For Programmers: Free Programming Magazines  


Home > Archive > PERL CGI Beginners > May 2004 > Output Unicode









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Output Unicode
Octavian Rasnita

2004-05-22, 11:31 am

Hi all,

Does anyone know how could I print a UTF-8 HTML page (like Google's one)?

Which modules I need to use?

Is perl able to do that?

Thank you.


Teddy

Wiggins D Anconia

2004-05-22, 11:31 am

> Hi all,
>
> Does anyone know how could I print a UTF-8 HTML page (like Google's one)?
>
> Which modules I need to use?
>
> Is perl able to do that?
>
> Thank you.
>


You may find some useful help in,

perldoc perluniintro
perldoc perlunicode
perldoc utf8

I suspect it can, just don't know much more about it.

http://danconia.org
Mt M

2004-05-22, 11:31 am

You certainly can.

There's another alias - "perl-unicode@perl.org" to which you can send
unicode related questions.

To set the HTTP header correctly you should use the following:

print header (-charset => 'utf-8')';

as opposed to the standard :
print header;

-which sets the page encoding to iso-8859-1 (west european) by default.

The members of the above alias will be able to provide more support.




>From: "Wiggins d Anconia" <wiggins@danconia.org>
>To: "Octavian Rasnita" <orasnita@fcc.ro>, <beginners-cgi@perl.org>
>Subject: Re: Output Unicode
>Date: Tue, 30 Mar 2004 07:38:27 -0700
>MIME-Version: 1.0
>Received: from onion.perl.org ([63.251.223.166]) by mc2-f30.hotmail.com
>with Microsoft SMTPSVC(5.0.2195.6824); Tue, 30 Mar 2004 06:39:00 -0800
>Received: (qmail 25208 invoked by uid 1005); 30 Mar 2004 14:38:54 -0000
>Received: (qmail 25188 invoked by uid 76); 30 Mar 2004 14:38:54 -0000
>X-Message-Info: JGTYoYF78jEHjJx36Oi8+YDSEg8qKPPD
>Mailing-List: contact beginners-cgi-help@perl.org; run by ezmlm
>Precedence: bulk
>List-Post: <mailto:beginners-cgi@perl.org>
>List-Help: <mailto:beginners-cgi-help@perl.org>
>List-Unsubscribe: <mailto:beginners-cgi-unsubscribe@perl.org>
>List-Subscribe: <mailto:beginners-cgi-subscribe@perl.org>
>Delivered-To: mailing list beginners-cgi@perl.org
>Delivered-To: beginners-cgi@perl.org
>X-Spam-Status: No, hits=0.0 required=7.0tests=
>X-Spam-Check-By: la.mx.develooper.com
>Message-Id: <200403301438.i2UEcRj25060@residualselfimage.com>
>X-Mailer: NeoMail 1.25
>X-IPAddress: 206.152.237.35
>Return-Path: beginners-cgi-return-10683-molmon=hotmail.com@perl.org
>X-OriginalArrivalTime: 30 Mar 2004 14:39:01.0343 (UTC)
>FILETIME=[BE0006F0:01C41664]
>
>one)?
>
>You may find some useful help in,
>
>perldoc perluniintro
>perldoc perlunicode
>perldoc utf8
>
>I suspect it can, just don't know much more about it.
>
>http://danconia.org
>
>--
>To unsubscribe, e-mail: beginners-cgi-unsubscribe@perl.org
>For additional commands, e-mail: beginners-cgi-help@perl.org
><http://learn.perl.org/> <http://learn.perl.org/first-response>
>
>


________________________________________
_________________________
MSN 8 helps eliminate e-mail viruses. Get 2 months FREE*.
http://join.msn.com/?page=features/virus

Octavian Rasnita

2004-05-22, 11:31 am

I have tried those modules and others like Encode, and they produce UTF-8
strings, but without printing those first 3 special chars which made the
browser and other programs to recognize that it is a UTF-8 file.

Thank you anyway.

Teddy

----- Original Message -----
From: "Wiggins d Anconia" <wiggins@danconia.org>
To: "Octavian Rasnita" <orasnita@fcc.ro>; <beginners-cgi@perl.org>
Sent: Tuesday, March 30, 2004 5:38 PM
Subject: Re: Output Unicode


Mt M

2004-05-22, 11:31 am

>I have tried those modules and others like Encode, and they produce UTF-8
>strings, but without printing those first 3 special chars which made the
>browser and other programs to recognize that it is a UTF-8 file.


Are you talking about the Byte Order Marks (BOM) ? The browser doesn't need
these to know that the file is UTF-8. - Other apps might.

There are several ways to tell a browser what the encoding of a page is. Use
one of these methods when outputting from your cgi script, then do a
"View->Page Encoding" in the browser, and you'll see it set to UTF-8.

1. print header (-charset => 'utf-8')';
-this sets the HTTP header

2. Put this in the <head> tag:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

3. there's another way that I can't remember....

You only need #1 or #2. If you set both to the same encoding then that's no
problem. But if you set both to different encodings, then #1 wins.

There's an article on unicode in Perl 5.6.1 at
http://developers.sun.com/dev/gadc/...rl/perl561.html
though I know that Perl 5.8 has made big strides wrt to unicode.

- and you can also pose questions at the feedback page:
http://developers.sun.com/contact/f...l&category=gadc


Or subscribe to the perl unicode alias I mentioned earlier.


-m




>I have tried those modules and others like Encode, and they produce UTF-8
>strings, but without printing those first 3 special chars which made the
>browser and other programs to recognize that it is a UTF-8 file.
>
>Thank you anyway.
>
>Teddy
>
>----- Original Message -----
>From: "Wiggins d Anconia" <wiggins@danconia.org>
>To: "Octavian Rasnita" <orasnita@fcc.ro>; <beginners-cgi@perl.org>
>Sent: Tuesday, March 30, 2004 5:38 PM
>Subject: Re: Output Unicode
>


________________________________________
_________________________
STOP MORE SPAM with the new MSN 8 and get 2 months FREE*
http://join.msn.com/?page=features/junkmail

Octavian Rasnita

2004-05-22, 11:31 am

Thank you.

I have tried to set the header of the web page as you described, but I have
seen that the special chars like ºþãîâªÞÃΠare not recognized correctly,
even though the browser recognizes that the encoding is UTF-8.

However, I have seen that the page returned by Google is viewed correctly,
but their page uses those byte order marks special chars that I don't know
how to print.

If I print only common ASCII chars, I have no problem printing them in UTF8.

Thanks.

I will subscribe to the other mailing list. I hope the address for
subscribing is perl-unicode-subscribe@perl.org

Teddy

----- Original Message -----
From: "mt m" <molmon@hotmail.com>
To: <orasnita@fcc.ro>; <wiggins@danconia.org>; <beginners-cgi@perl.org>
Sent: Tuesday, March 30, 2004 10:51 PM
Subject: Re: Output Unicode


>
> Are you talking about the Byte Order Marks (BOM) ? The browser doesn't

need
> these to know that the file is UTF-8. - Other apps might.
>
> There are several ways to tell a browser what the encoding of a page is.

Use
> one of these methods when outputting from your cgi script, then do a


Mt M

2004-05-22, 11:31 am


I'd say the problem is that the content of your page is not in fact in
UTF-8. Telling the browser that it is is one thing, but that doesn't make
the content itself UTF-8 encoded.


Are you sure you can actually create a UTF-8 encoded file?

If you create a web page using Mozilla Composer ( part of the Mozilla
browser bundle - free at mozilla.org), it allows you to save it as UTF-8.
That's what I did with multiling.txt attached. - except I exported it as
text. [It may look like garbage in notepad - but it should be ok when viewed
as UTF-8 in browser]
It contains 3 strings - japanese, korean and hebrew.
[to input ja, ko or he strings - merely copy and paste them from a website
in that language]

Find also atahced a simple script to read in the text and ouput it to the
browser.
Run this script in a browser - and it should output strings in ja, ko and
he.

[ To make sure you've got the correct font support for viewing languages
encoded in utf-8, visit the UTF-8 sampler page at
http://www.columbia.edu/kermit/utf8.html ]

Other ways of outputting utf-8 characters in perl are available in the
article at http://developers.sun.com/dev/gadc/...rl/perl561.html -
see the 'source' links.

>
>Thank you.
>
>I have tried to set the header of the web page as you described, but I have
>seen that the special chars like ºþãîâªÞÃΠare not recognized correctly,
>even though the browser recognizes that the encoding is UTF-8.
>
>However, I have seen that the page returned by Google is viewed correctly,
>but their page uses those byte order marks special chars that I don't know
>how to print.
>
>If I print only common ASCII chars, I have no problem printing them in
>UTF8.
>
>Thanks.
>


________________________________________
_________________________
Protect your PC - get McAfee.com VirusScan Online
http://clinic.mcafee.com/clinic/ibu...gn.asp?cid=3963

Octavian Rasnita

2004-05-22, 11:31 am

Hi,

Thank you for these examples.
I have tried the program, but it printed the following result on Internet
Explorer 6:

Reading and displaying a file with UTF-8 encoded multilingual text.
Japanese string:
?????? | ????| ???? | ?????- |
???? | ??

Korean:
?? ??? ?? ? ???. ??? ??? ???

Hebrew
??? ???? ????? ?????? ??? ?? ???? ??.


It seems that something's wrong because Internet Explorer automaticly
chooses UTF-8 encoding, but it doesn't display the text correctly.
In fact, I don't know which is the problem because I read the text from the
screen using a screen reader (I am blind) but I can read other UTF encoded
pages like Google's page, without problems.

Thank you.

T

----- Original Message -----
From: "mt m" <molmon@hotmail.com>
To: <orasnita@fcc.ro>; <beginners-cgi@perl.org>
Sent: Wednesday, March 31, 2004 7:21 PM
Subject: Re: Output Unicode


>
> I'd say the problem is that the content of your page is not in fact in
> UTF-8. Telling the browser that it is is one thing, but that doesn't make
> the content itself UTF-8 encoded.
>
>
> Are you sure you can actually create a UTF-8 encoded file?
>
> If you create a web page using Mozilla Composer ( part of the Mozilla
> browser bundle - free at mozilla.org), it allows you to save it as UTF-8.
> That's what I did with multiling.txt attached. - except I exported it as
> text. [It may look like garbage in notepad - but it should be ok when

viewed
> as UTF-8 in browser]
> It contains 3 strings - japanese, korean and hebrew.
> [to input ja, ko or he strings - merely copy and paste them from a website
> in that language]


Wc -Sx- Jones

2004-05-22, 11:31 am

Octavian Rasnita wrote:

> It seems that something's wrong because Internet Explorer automaticly
> chooses UTF-8 encoding, but it doesn't display the text correctly.
> In fact, I don't know which is the problem because I read the text from the
> screen using a screen reader (I am blind) but I can read other UTF encoded
> pages like Google's page, without problems.
>


Try this - If it works in Mozilla 1.6 and Doesn't work in IE -- then
IE is broken. If it doesn't work in Mozilla then the creation method
is broken.

I get multi-byte chracter's decoded and properly displayed in Mozilla
always -- unless the generation method was invalid to start with.

HTH/Sx
Octavian Rasnita

2004-05-22, 11:31 am

Oh thanks, this is helpful.
I can see that it is very complicated to use Unicode standards.

I have seen that on that page I can read the text in romanian language, but
even though I can read well some chars, I am not able to read other special
chars and I can read just question marks instead.

I know that I might need to install some fonts in order to be able to read
them correctly, but it might be a problem with the UTF encoding of that
page, because as I said, I am able to read Google's page without problems.

Teddy

----- Original Message -----
From: "mt m" <molmon@hotmail.com>
To: <orasnita@fcc.ro>; <beginners-cgi@perl.org>
Sent: Thursday, April 01, 2004 9:23 PM
Subject: Re: Output Unicode


> I think it's your font support.
>
> Go to http://www.columbia.edu/kermit/utf8.html
>
> This multilingual page has strings in many languages - all UTF-8 encoded.
>
> If your browser can't render text for a specific language on this page,

then
> the problem is your font support.
>


Mt M

2004-05-22, 11:31 am


>Oh thanks, this is helpful.
>I can see that it is very complicated to use Unicode standards.


well it can be. But if you've got a new browser (mozilla 1.6), and a
reasonably new OS, - Solaris 9/XP/JDS then you should be fine for viewing
UTF-8 encoded pages in most languages.


>
>I have seen that on that page I can read the text in romanian language, but
>even though I can read well some chars, I am not able to read other special
>chars and I can read just question marks instead.
>I know that I might need to install some fonts in order to be able to read
>them correctly,


yes. 9 times out of 10, the question mark problem is indicative of a font
issue - not an encoding one.

>but it might be a problem with the UTF encoding of that
>page,


no!

>because as I said, I am able to read Google's page without problems.


Google doesn't always use UTF-8. For example, if you use Netscape 4.7x (no
one should use it, but it's out there...) and fetch http://www.google.com,
it'll return the page iso-8859-1 encoded.
If you fetch http://google.co.jp it'll return it encoded as Shift_jis (or
some other native japanese encoding) etc. i.e. Google recognises that older
browsers don't really support UTF-8 well - so they send content in native
encodings instead.


>
>Teddy
>
>----- Original Message -----
>From: "mt m" <molmon@hotmail.com>
>To: <orasnita@fcc.ro>; <beginners-cgi@perl.org>
>Sent: Thursday, April 01, 2004 9:23 PM
>Subject: Re: Output Unicode
>
>
>encoded.
>then
>


________________________________________
_________________________
Tired of spam? Get advanced junk mail protection with MSN 8.
http://join.msn.com/?page=features/junkmail

Octavian Rasnita

2004-05-22, 11:31 am

Yes but I get Google's page with Internet Explorer 6 and I can see that the
page uses UTF-8. And I can see teh page fine.
But that example page read with IE6 also, is not read correctly.

T.

----- Original Message -----
From: "mt m" <molmon@hotmail.com>
To: <orasnita@fcc.ro>; <beginners-cgi@perl.org>
Sent: Friday, April 02, 2004 3:47 PM
Subject: Re: Output Unicode


>
>
> well it can be. But if you've got a new browser (mozilla 1.6), and a
> reasonably new OS, - Solaris 9/XP/JDS then you should be fine for viewing
> UTF-8 encoded pages in most languages.
>
>


Mt M

2004-05-22, 11:31 am

which google url are you accessing?



>From: "Octavian Rasnita" <orasnita@fcc.ro>
>To: "mt m" <molmon@hotmail.com>, <beginners-cgi@perl.org>
>Subject: Re: Output Unicode
>Date: Fri, 2 Apr 2004 15:59:10 +0300
>MIME-Version: 1.0
>Received: from onion.perl.org ([63.251.223.166]) by mc2-f14.hotmail.com
>with Microsoft SMTPSVC(5.0.2195.6824); Fri, 2 Apr 2004 04:54:59 -0800
>Received: (qmail 80022 invoked by uid 1005); 2 Apr 2004 12:54:55 -0000
>Received: (qmail 80007 invoked by uid 76); 2 Apr 2004 12:54:54 -0000
>X-Message-Info: JGTYoYF78jE9bEGLbvndwDuh3RJSuy17
>Mailing-List: contact beginners-cgi-help@perl.org; run by ezmlm
>Precedence: bulk
>List-Post: <mailto:beginners-cgi@perl.org>
>List-Help: <mailto:beginners-cgi-help@perl.org>
>List-Unsubscribe: <mailto:beginners-cgi-unsubscribe@perl.org>
>List-Subscribe: <mailto:beginners-cgi-subscribe@perl.org>
>Delivered-To: mailing list beginners-cgi@perl.org
>Delivered-To: beginners-cgi@perl.org
>X-Spam-Status: No, hits=0.0 required=7.0tests=
>X-Spam-Check-By: la.mx.develooper.com
>Message-ID: <004301c418b2$50197ef0$251c320a@teddy>
>References: <Sea2-F66cNKYXgqRRLY0001679f@hotmail.com>
>X-MSMail-Priority: Normal
>X-Mailer: Microsoft Outlook Express 6.00.2800.1158
>X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165
>Return-Path: beginners-cgi-return-10710-molmon=hotmail.com@perl.org
>X-OriginalArrivalTime: 02 Apr 2004 12:54:59.0231 (UTC)
>FILETIME=[B4A692F0:01C418B1]
>
>Yes but I get Google's page with Internet Explorer 6 and I can see that the
>page uses UTF-8. And I can see teh page fine.
>But that example page read with IE6 also, is not read correctly.
>
>T.
>
>----- Original Message -----
>From: "mt m" <molmon@hotmail.com>
>To: <orasnita@fcc.ro>; <beginners-cgi@perl.org>
>Sent: Friday, April 02, 2004 3:47 PM
>Subject: Re: Output Unicode
>
>
>viewing
>
>
>--
>To unsubscribe, e-mail: beginners-cgi-unsubscribe@perl.org
>For additional commands, e-mail: beginners-cgi-help@perl.org
><http://learn.perl.org/> <http://learn.perl.org/first-response>
>
>


________________________________________
_________________________
The new MSN 8: advanced junk mail protection and 2 months FREE*
http://join.msn.com/?page=features/junkmail

Octavian Rasnita

2004-05-22, 11:31 am

I am accessing www.google.com which redirects to www.google.ro, or
www.google.com/ncr

T

----- Original Message -----
From: "mt m" <molmon@hotmail.com>
To: <orasnita@fcc.ro>; <beginners-cgi@perl.org>
Sent: Friday, April 02, 2004 6:35 PM
Subject: Re: Output Unicode


> which google url are you accessing?
>
>
>
the[color=darkred]
>
> ________________________________________
_________________________
> The new MSN 8: advanced junk mail protection and 2 months FREE*
> http://join.msn.com/?page=features/junkmail


Wc -Sx- Jones

2004-05-22, 11:31 am

Octavian Rasnita wrote:
> I am accessing www.google.com which redirects to www.google.ro, or
> www.google.com/ncr


That is an auto-handshake between your browser and google.

It means you've properly set-up what Language(s) you want
first and google is trying to be helpful.

(In a previous note I said that the example page
was likely created wrong or you are missing a
MIME handshake that IE6 wants...)

Does the page display correctly in Moz 1.6?

Does the page in question use MIME Content bodies
or is it an XML-Stylesheet?

At any rate this is likely OT for a CGI group.

My apache 2.x WWW server knows about theses

# Danish (da) - Dutch (nl) - English (en) - Estonian (et)
# French (fr) - German (de) - Gr-Modern (el)
# Italian (it) - Norwegian (no) - Norwegian Nynorsk (nn) - Korean (ko)
# Portugese (pt) - Luxembourgeois* (ltz)
# Spanish (es) - Swedish (sv) - Catalan (ca) - Czech(cs)
# Polish (pl) - Brazilian Portuguese (pt-br) - Japanese (ja)
# Russian (ru) - Croatian (hr)
#
AddLanguage da .dk
AddLanguage nl .nl
AddLanguage en .en
AddLanguage et .et
AddLanguage fr .fr
AddLanguage de .de
AddLanguage he .he
AddLanguage el .el
AddLanguage it .it
AddLanguage ja .ja
AddLanguage pl .po
AddLanguage ko .ko
AddLanguage pt .pt
AddLanguage nn .nn
AddLanguage no .no
AddLanguage pt-br .pt-br
AddLanguage ltz .ltz
AddLanguage ca .ca
AddLanguage es .es
AddLanguage sv .sv
AddLanguage cs .cz .cs
AddLanguage ru .ru
AddLanguage zh-CN .zh-cn
AddLanguage zh-TW .zh-tw
AddLanguage hr .hr

#
# LanguagePriority allows you to give precedence to some languages
# in case of a tie during content negotiation.
#
# Just list the languages in decreasing order of preference. We have
# more or less alphabetized them here. You probably want to change this.
#
LanguagePriority en da nl et fr de el it ja ko no pl pt pt-br ltz ca es
sv tw

#
# ForceLanguagePriority allows you to serve a result page rather than
# MULTIPLE CHOICES (Prefer) [in case of a tie] or NOT ACCEPTABLE (Fallback)
# [in case no accepted languages matched the available variants]
#
ForceLanguagePriority Prefer Fallback

#
# Specify a default charset for all pages sent out. This is
# always a good idea and opens the door for future internationalisation
# of your web site, should you ever want it. Specifying it as
# a default does little harm; as the standard dictates that a page
# is in iso-8859-1 (latin1) unless specified otherwise i.e. you
# are merely stating the obvious. There are also some security
# reasons in browsers, related to javascript and URL parsing
# which encourage you to always set a default char set.
#
AddDefaultCharset ISO-8859-1

#
# Commonly used filename extensions to character sets. You probably
# want to avoid clashes with the language extensions, unless you
# are good at carefully testing your setup after each change.
# See http://www.iana.org/assignments/character-sets for the
# official list of charset names and their respective RFCs.
#
AddCharset ISO-8859-1 .iso8859-1 .latin1
AddCharset ISO-8859-2 .iso8859-2 .latin2 .cen
AddCharset ISO-8859-3 .iso8859-3 .latin3
AddCharset ISO-8859-4 .iso8859-4 .latin4
AddCharset ISO-8859-5 .iso8859-5 .latin5 .cyr .iso-ru
AddCharset ISO-8859-6 .iso8859-6 .latin6 .arb
AddCharset ISO-8859-7 .iso8859-7 .latin7 .grk
AddCharset ISO-8859-8 .iso8859-8 .latin8 .heb
AddCharset ISO-8859-9 .iso8859-9 .latin9 .trk
AddCharset ISO-2022-JP .iso2022-jp .jis
AddCharset ISO-2022-KR .iso2022-kr .kis
AddCharset ISO-2022-CN .iso2022-cn .cis
AddCharset Big5 .Big5 .big5
# For russian, more than one charset is used (depends on client, mostly):
AddCharset WINDOWS-1251 .cp-1251 .win-1251
AddCharset CP866 .cp866
AddCharset KOI8-r .koi8-r .koi8-ru
AddCharset KOI8-ru .koi8-uk .ua
AddCharset ISO-10646-UCS-2 .ucs2
AddCharset ISO-10646-UCS-4 .ucs4
AddCharset UTF-8 .utf8

# The set below does not map to a specific (iso) standard
# but works on a fairly wide range of browsers. Note that
# capitalization actually matters (it should not, but it
# does for some browsers).
#
# See http://www.iana.org/assignments/character-sets
# for a list of sorts. But browsers support few.
#
AddCharset GB2312 .gb2312 .gb
AddCharset utf-7 .utf7
AddCharset utf-8 .utf8
AddCharset big5 .big5 .b5
AddCharset EUC-TW .euc-tw
AddCharset EUC-JP .euc-jp
AddCharset EUC-KR .euc-kr
AddCharset shift_jis .sjis


And, if I choose, I can send out language specific
content using this type of file:

# <Directory "/usr/local/apache2/error">
# AllowOverride None
# Options IncludesNoExec
# AddOutputFilter Includes html
# AddHandler type-map var
# Order allow,deny
# Allow from all
# LanguagePriority en cs de es fr it nl sv pt-br ro
# ForceLanguagePriority Prefer Fallback
# </Directory>

On a directory by directory basis (this
example is from "errors" as I dont serve
anything but generic American English.)


My advice? Find out what Google is
doing so you can mimic it...

-Sx-


PS -
I go to either you listed and it forces me back to:

google.com
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com