Home > Archive > PERL Miscellaneous > February 2005 > How to decode this unicode-hex string
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
How to decode this unicode-hex string
|
|
| * Tong * 2005-02-25, 3:59 pm |
| Hi,
When I select from non-English web sites and paste into my emacs,
sometimes I get a unicode-hex string like this: \u82f1\u6587, which was
"English" in Big5 encoding.
I'm wondering how I can decode such strings and return the 8-bit character.
So far I've been looking into the following Perl modules man pages an
tried each one of them: Unicode::UTF8simple, Unicode::String,
Unicode::Lite. None of them seems to be able to do that. They handle
unicode-hex strings like this: "U+00d6 U+00d0 U+00b9 U+00fa". The
difference between the above representation is that, the \u82f1 represent
one 8-bit character, while in Perl it is represented in two U+00xx values.
I had also played with tcl decodings, but wasn't successful. Please help.
Thanks a lot!
tong
--
Tong (remove underscore(s) to reply)
*niX Power Tools Project: http://xpt.sourceforge.net/
- All free contribution & collection
| |
| phaylon 2005-02-25, 3:59 pm |
| * Tong * wrote:
> I'm wondering how I can decode such strings and return the 8-bit
> character.
Sometimes I think all some people read from this group before posting is
the name. Look at the thread right before yours.
--
http://www.dunkelheit.at/
The eternal mistake of mankind is to set up an attainable ideal.
-- Aleister Crowley
| |
| * Tong * 2005-02-25, 3:59 pm |
| On Fri, 25 Feb 2005 17:42:09 +0100, phaylon wrote:
>
> Sometimes I think all some people read from this group before posting is
> the name. Look at the thread right before yours.
Can you at least specify the thread subject if you want to help? Did you
mean the thread "How to convert latin1 to utf8"? Did you see that I've tried the
Unicode::String (and much more) before the posting? After all, have you
read the two threads carefully and seen the giant difference between them?
--
Tong (remove underscore(s) to reply)
*niX Power Tools Project: http://xpt.sourceforge.net/
- All free contribution & collection
| |
| phaylon 2005-02-25, 3:59 pm |
| * Tong * wrote:
> Can you at least specify the thread subject if you want to help?
No, that's your job. My job is to code. But sometimes I make breaks. And,
I'm sorry if this is offensive to you, but I'm not willing to spend my
breaks doing someone other's work.
> Did you mean the thread "How to convert latin1 to utf8"?
Bingo.
> Did you see that I've tried the Unicode::String (and much more) before
> the posting?
Yeah. And I said there I would try out Encode, have you done that?
> After all, have you read the two threads carefully and seen the giant
> difference between them?
Nope, clear me up.
--
http://www.dunkelheit.at/
That is not dead, which can eternal lie,
and with strange aeons even death may die.
-- H.P. Lovecraft
| |
| * Tong * 2005-02-26, 3:57 am |
| On Fri, 25 Feb 2005 11:30:37 -0500, * Tong * wrote:
> When I select from non-English web sites and paste into my emacs,
> sometimes I get a unicode-hex string like this: \u82f1\u6587, which was
> "English" in Big5 encoding.
>
> I'm wondering how I can decode such strings and return the 8-bit character.
>
> So far I've been looking into the following Perl modules man pages an
> tried each one of them: Unicode::UTF8simple, Unicode::String,
> Unicode::Lite. None of them seems to be able to do that. They handle
> unicode-hex strings like this: "U+00d6 U+00d0 U+00b9 U+00fa". The
> difference between the above representation is that, the \u82f1 represent
> one 8-bit character, while in Perl it is represented in two U+00xx values.
>
> I had also played with tcl decodings, but wasn't successful. Please help.
Hi,
As per the suggestion from phaylon, I gave 'Encode' a try. Maybe I've
missed a very important part, but I still can't decode the unicode string
like \u82f1\u6587, using any of Encode, Unicode::UTF8simple,
Unicode::String, or Unicode::Lite.
More reading revealed that the "\u82f1\u6587" format is the default form
for Java to use unicode. Maybe I should use Java, but I don't want to if
this problem can be solved in Perl.
Thanks for your help!
--
Tong (remove underscore(s) to reply)
*niX Power Tools Project: http://xpt.sourceforge.net/
- All free contribution & collection
| |
| RedGrittyBrick 2005-02-26, 3:57 am |
| * Tong * wrote:
> Hi,
>
> When I select from non-English web sites and paste into my emacs,
> sometimes I get a unicode-hex string like this: \u82f1\u6587, which was
> "English" in Big5 encoding.
I'm . Unicode and Big5 are completely different aren't they? For
one thing Unicode is a character set, there are several encodings such
as UTF-8.
u8251 and u6581 are Chinese characters in Unicode. They are within the
CJK Unified Ideographs 4E00-9FAF.
http://www.unicode.org/charts/PDF/U4E00.pdf
Together they form the Chonese word whose English translation is the
word "English".
> I'm wondering how I can decode such strings and return the 8-bit character.
An 8-bit character set would surely not be large enough to contain a
usable subset of the Chinese ideographs. Big 5 has 13,000 ideographs. An
8-bit character set has room for 256 at most.
When you say "the 8 bit character" are you thinking of something like
the ISO 8859-1 Latin-1 character set?
Without a Chinese-English dictionary, there's no way to "decode" the two
Chinese ideograms u8251 u6581 into the seven English letters u0045 u006e
u0067 u006C u0069 u0073 u0068
> So far I've been looking into the following Perl modules man pages an
> tried each one of them: Unicode::UTF8simple, Unicode::String,
> Unicode::Lite. None of them seems to be able to do that. They handle
> unicode-hex strings like this: "U+00d6 U+00d0 U+00b9 U+00fa". The
> difference between the above representation is that,
> the \u82f1 represent one 8-bit character,
No it doesn't!
while in Perl it is represented in two U+00xx values.
Two U+00xx values represent *TWO* Latin-1 characters.
| |
| * Tong * 2005-02-26, 3:57 am |
| Thanks for the reply.
On Fri, 25 Feb 2005 21:03:15 +0000, RedGrittyBrick wrote:
>
> No it doesn't!
>
> while in Perl it is represented in two U+00xx values.
>
> Two U+00xx values represent *TWO* Latin-1 characters.
Yeah, I stated wrong. It should read
the \u82f1 represent one Chinese character, which is in two 8-bit
characters
Any way, I figured out a way to do it, without any the aforementioned
unicode packages.
Thanks for clear things up.
--
Tong (remove underscore(s) to reply)
*niX Power Tools Project: http://xpt.sourceforge.net/
- All free contribution & collection
| |
| Alan J. Flavell 2005-02-26, 3:57 am |
| On Fri, 25 Feb 2005, * Tong * wrote:
> the \u82f1 represent one Chinese character,
Yes
> which is in two 8-bit characters
No way. As written, it's six *characters*. Encoded, it might be
two *bytes* (depends on the encoding).
> Any way, I figured out a way to do it, without any the
> aforementioned unicode packages.
But you're not going to tell us what it is?
| |
| * Tong * 2005-02-27, 8:57 pm |
| On Fri, 25 Feb 2005 21:42:38 +0000, Alan J. Flavell wrote:
>
> But you're not going to tell us what it is?
Well, it actually has nothing to do with unicode. Here is what I did to
decode such string:
perl -pe 's / \\u([0-9a-f]+) / chr(hex($1)) /giex;' 2>/dev/null;
--
Tong (remove underscore(s) to reply)
*niX Power Tools Project: http://xpt.sourceforge.net/
- All free contribution & collection
| |
| Alan J. Flavell 2005-02-28, 3:57 am |
| On Sun, 27 Feb 2005, * Tong * wrote:
>
> Well, it actually has nothing to do with unicode.
Actually, it has a great deal to do with Unicode...
> Here is what I did to decode such string:
>
> perl -pe 's / \\u([0-9a-f]+) / chr(hex($1)) /giex;' 2>/dev/null;
Fine. chr(hex($1)) is the Unicode character in question - in Perl's
native representation.
Thanks. It just goes to show how seamless Perl's Unicode
implementation is, when one can use it without even believing in it
;-)
Perhaps our questioner on another thread, who's determined to prevent
Perl's unicode from working for him, could take a lesson from this.
all the best
|
|
|
|
|