For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > June 2007 > still working with utf8









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author still working with utf8
Tom Allison

2007-06-22, 7:59 am

OK, I sorted out what the deal is with charsets, Encode, utf8 and
other goodies.

Now I have something I'm just not sure exactly how it is supposet to
operate.

I have a string:
=?iso-2022-jp?B? Rlc6IBskQjxkJDckNSRHJE8kSiQvJEYzWiQ3JF8k
PyQkGyhC?=
That is a MIME::Base64 encoded string of iso-2022-jp characters.

After I decode_base64 them and decode($text,'iso-2022-jp',utf8') them
I can print out something that looks exactly like japanese characters.

But you can't match /(\w+) on them. It's apparently one "word"
without spaces in it.
Um... I don't know Japanese. But I guess this string of spaghetti
(to me) is actually a language where one character as represented in
a unicode terminal is actually one 'word' according to the perl
definition of a word...

In english, this would pick apart words in a sense that is simple for
me and many on this list to understand.

I guess my question is, for CJK languages, should I expect the notion
of using a regex like \w+ to pick up entire strings of text instead
of discrete words like latin based languages?


Tom Phoenix

2007-06-22, 7:59 am

On 6/21/07, Tom Allison <tom@tacocat.net> wrote:

> I guess my question is, for CJK languages, should I expect the notion
> of using a regex like \w+ to pick up entire strings of text instead
> of discrete words like latin based languages?


Once you've enabled what the perlunicode manpage calls "Character
Semantics", it says:

Character classes in regular expressions match characters instead
of bytes and match against the character properties specified in
the Unicode properties database. "\w" can be used to match a
Japanese ideograph, for instance.

http://perldoc.perl.org/perlunicode.html

Does that manpage get you any closer to a solution? Hope this helps!

--Tom Phoenix
Stonehenge Perl Training
Mumia W.

2007-06-22, 7:59 am

On 06/21/2007 09:42 PM, Tom Allison wrote:
> OK, I sorted out what the deal is with charsets, Encode, utf8 and other
> goodies.
>
> Now I have something I'm just not sure exactly how it is supposet to
> operate.
>
> I have a string:
> =?iso-2022-jp?B? Rlc6IBskQjxkJDckNSRHJE8kSiQvJEYzWiQ3JF8k
PyQkGyhC?=
> That is a MIME::Base64 encoded string of iso-2022-jp characters.
>
> After I decode_base64 them and decode($text,'iso-2022-jp',utf8') them I
> can print out something that looks exactly like japanese characters.
>
> But you can't match /(\w+) on them. It's apparently one "word" without
> spaces in it.
> Um... I don't know Japanese. But I guess this string of spaghetti (to
> me) is actually a language where one character as represented in a
> unicode terminal is actually one 'word' according to the perl definition
> of a word...
>
> In english, this would pick apart words in a sense that is simple for me
> and many on this list to understand.
>
> I guess my question is, for CJK languages, should I expect the notion of
> using a regex like \w+ to pick up entire strings of text instead of
> discrete words like latin based languages?
>


Sadly, I must admit that I'm operating way outside of my knowledge
domain on this one, but I'll try to give an answer.

Yes, be prepared for the fact that not all foreign languages will
support the concept of spaces between words. I don't know anything about
Japanese, but I do vaguely remember from high school that, for Chinese
texts, there are often no spaces between words and the reader's
knowledge of the language allows him or her to infer the word separations.

However, even without knowing Japanese, we might be able to help you
find acceptable solutions. What is your program supposed to do?

Dr.Ruud

2007-06-22, 7:59 am

Tom Allison schreef:

> I have a string:
> =?iso-2022-jp?B? Rlc6IBskQjxkJDckNSRHJE8kSiQvJEYzWiQ3JF8k
PyQkGyhC?=
> That is a MIME::Base64 encoded string of iso-2022-jp characters.
>
> After I decode_base64 them and decode($text,'iso-2022-jp',utf8') them
> I can print out something that looks exactly like japanese characters.
>
> But you can't match /(\w+) on them. It's apparently one "word"
> without spaces in it.


http://www.patentstorm.us/patents/5...escription.html
(look for JLE)

So maybe if you convert to EUC, than insert spaces as the text suggests,
than convert back to utf8, you might have a "better" string to work
with.

--
Affijn, Ruud

"Gewoon is een tijger."

Bob McConnell

2007-06-22, 9:59 pm

> -----Original Message-----
> From: tom@tacocat.net [mailto:tom@tacocat.net]=20
> Sent: Friday, June 22, 2007 8:36 AM
> To: mumia.w.18.spam+nospam@earthlink.net; beginners@perl.org;=20
> Mumia W.; Beginners List
> Subject: Re: still working with utf8
>=20
>=20
>=20
> anything about
>=20
> for Chinese
>=20
>=20
> word separations.
>=20
>=20
>=20
> So the chinese might have a sentence like:
>=20
> thequickbrownfoxjumpedoverthefence
>=20
> and it's up to you, the reader, to figure out where the spaces are?
>=20


It has been a while since I had to deal with Asian character sets, but
for Chinese and (I believe) Kanji (Japanese) each pictograph (character)
is a word, so no spaces are required. Katakana is the phonetic version
of Japanese, which may or may not have spaces between the words. I never
had to read them, only validate that the images in the service manuals
looked like what was being displayed or printed.

Bob McConnell
Mistrs

2007-06-26, 4:44 pm

Lindsay Lohan and Alyson Hannigan Seducing Waitress!
http://www.theillegalsite.com/PlayMovie.wmv?q=1673286

download porn movies free asian lesbian sex video download tube video free video clip of big ass video de paris hilton
michael jackson ghost video flash free funny video free girl video watch free xxx video britney pic spear tattoo
adult dvd movies britney hair sedu spear style coming free movie shemale free train video game listen to usher music
free movie porn sex porno movie video free britney spears paparazzi pic free lesbian porn sample video clip free porn teen

film gratis porn video
free big tit porn clip
sexy porn star video
adult movie club girls
free x movie
video de daddy yankee
max hardcore sex movie galerie
xxx movie trailer
epic movie mp3
clip naruto video video
Bursh03

2007-06-28, 7:35 am

Toying, di|dos, vibros, best girls only! They are all yours for free at http://girls-with-toys.info/gal218571
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com