Home > Archive > PERL Beginners > June 2007 > still working with utf8
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
still working with utf8
|
|
| Tom Allison 2007-06-22, 7:59 am |
| OK, I sorted out what the deal is with charsets, Encode, utf8 and
other goodies.
Now I have something I'm just not sure exactly how it is supposet to
operate.
I have a string:
=?iso-2022-jp?B? Rlc6IBskQjxkJDckNSRHJE8kSiQvJEYzWiQ3JF8k
PyQkGyhC?=
That is a MIME::Base64 encoded string of iso-2022-jp characters.
After I decode_base64 them and decode($text,'iso-2022-jp',utf8') them
I can print out something that looks exactly like japanese characters.
But you can't match /(\w+) on them. It's apparently one "word"
without spaces in it.
Um... I don't know Japanese. But I guess this string of spaghetti
(to me) is actually a language where one character as represented in
a unicode terminal is actually one 'word' according to the perl
definition of a word...
In english, this would pick apart words in a sense that is simple for
me and many on this list to understand.
I guess my question is, for CJK languages, should I expect the notion
of using a regex like \w+ to pick up entire strings of text instead
of discrete words like latin based languages?
| |
| Tom Phoenix 2007-06-22, 7:59 am |
| On 6/21/07, Tom Allison <tom@tacocat.net> wrote:
> I guess my question is, for CJK languages, should I expect the notion
> of using a regex like \w+ to pick up entire strings of text instead
> of discrete words like latin based languages?
Once you've enabled what the perlunicode manpage calls "Character
Semantics", it says:
Character classes in regular expressions match characters instead
of bytes and match against the character properties specified in
the Unicode properties database. "\w" can be used to match a
Japanese ideograph, for instance.
http://perldoc.perl.org/perlunicode.html
Does that manpage get you any closer to a solution? Hope this helps!
--Tom Phoenix
Stonehenge Perl Training
| |
| Mumia W. 2007-06-22, 7:59 am |
| On 06/21/2007 09:42 PM, Tom Allison wrote:
> OK, I sorted out what the deal is with charsets, Encode, utf8 and other
> goodies.
>
> Now I have something I'm just not sure exactly how it is supposet to
> operate.
>
> I have a string:
> =?iso-2022-jp?B? Rlc6IBskQjxkJDckNSRHJE8kSiQvJEYzWiQ3JF8k
PyQkGyhC?=
> That is a MIME::Base64 encoded string of iso-2022-jp characters.
>
> After I decode_base64 them and decode($text,'iso-2022-jp',utf8') them I
> can print out something that looks exactly like japanese characters.
>
> But you can't match /(\w+) on them. It's apparently one "word" without
> spaces in it.
> Um... I don't know Japanese. But I guess this string of spaghetti (to
> me) is actually a language where one character as represented in a
> unicode terminal is actually one 'word' according to the perl definition
> of a word...
>
> In english, this would pick apart words in a sense that is simple for me
> and many on this list to understand.
>
> I guess my question is, for CJK languages, should I expect the notion of
> using a regex like \w+ to pick up entire strings of text instead of
> discrete words like latin based languages?
>
Sadly, I must admit that I'm operating way outside of my knowledge
domain on this one, but I'll try to give an answer.
Yes, be prepared for the fact that not all foreign languages will
support the concept of spaces between words. I don't know anything about
Japanese, but I do vaguely remember from high school that, for Chinese
texts, there are often no spaces between words and the reader's
knowledge of the language allows him or her to infer the word separations.
However, even without knowing Japanese, we might be able to help you
find acceptable solutions. What is your program supposed to do?
| |
| Dr.Ruud 2007-06-22, 7:59 am |
| Tom Allison schreef:
> I have a string:
> =?iso-2022-jp?B? Rlc6IBskQjxkJDckNSRHJE8kSiQvJEYzWiQ3JF8k
PyQkGyhC?=
> That is a MIME::Base64 encoded string of iso-2022-jp characters.
>
> After I decode_base64 them and decode($text,'iso-2022-jp',utf8') them
> I can print out something that looks exactly like japanese characters.
>
> But you can't match /(\w+) on them. It's apparently one "word"
> without spaces in it.
http://www.patentstorm.us/patents/5...escription.html
(look for JLE)
So maybe if you convert to EUC, than insert spaces as the text suggests,
than convert back to utf8, you might have a "better" string to work
with.
--
Affijn, Ruud
"Gewoon is een tijger."
| |
| Bob McConnell 2007-06-22, 9:59 pm |
| > -----Original Message-----
> From: tom@tacocat.net [mailto:tom@tacocat.net]=20
> Sent: Friday, June 22, 2007 8:36 AM
> To: mumia.w.18.spam+nospam@earthlink.net; beginners@perl.org;=20
> Mumia W.; Beginners List
> Subject: Re: still working with utf8
>=20
>=20
>=20
> anything about
>=20
> for Chinese
>=20
>=20
> word separations.
>=20
>=20
>=20
> So the chinese might have a sentence like:
>=20
> thequickbrownfoxjumpedoverthefence
>=20
> and it's up to you, the reader, to figure out where the spaces are?
>=20
It has been a while since I had to deal with Asian character sets, but
for Chinese and (I believe) Kanji (Japanese) each pictograph (character)
is a word, so no spaces are required. Katakana is the phonetic version
of Japanese, which may or may not have spaces between the words. I never
had to read them, only validate that the images in the service manuals
looked like what was being displayed or printed.
Bob McConnell
| |
|
|
|
|
|
|
|