Home > Archive > PERL Miscellaneous > October 2004 > How do I parse this Charactor? 2byte vs 1byte
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
How do I parse this Charactor? 2byte vs 1byte
|
|
|
| I found the bug, and could not fix it. It is related to OS related bytes.
Under dos, it looks like this
valign="top">ив</td>
In Unix, it looks like this
valign="top">| </td>
In my Textpad (in Chinese OS), it looks like this
valign="top">?/td>
I asked experts about this. They told me that the charactor is missed
combined with the next charactor to become one charactor.
How do I make sure my perl can parse this correctly? Can perl tell 2 byte
word and 1 byte word?
| |
| Karel Kubat 2004-10-27, 3:57 pm |
| Hi,
> I found the bug, and could not fix it. It is related to OS related bytes.
>
> Under dos, it looks like this
> valign="top">ив</td>
> In Unix, it looks like this
> valign="top">| </td>
> In my Textpad (in Chinese OS), it looks like this
> valign="top">?/td>
>
> I asked experts about this. They told me that the charactor is missed
> combined with the next charactor to become one charactor.
> How do I make sure my perl can parse this correctly? Can perl tell 2 byte
> word and 1 byte word?
This is not a Perl issue per se, and neither an OS-related issue. You're
dealing with multibyte encodings of characters.
You need to look at the encoding of the original document first. Off the top
of my head, in an XML document, it would say something like <?xml
version="1.0" encoding="....."?>. When the encoding specifier is missing,
then UTF-8 is the default I think.
Your problem however probably refers to an HTML page, not to an XML
document. In that case the encoding might be in one of the HTTP headers
that are sent when a server outputs a page -- that will depend on the
server configuration.
And regarding encodings or character sets: _yes_, Perl can be told to
regard 2-byte sequences as 1 character (or even more than 2 bytes,
actually). Try "perldoc -f multibyte" and then play around with the Unicode
modules.
What _is_ the problem you're describing anyway? It might be helpful to
know..
Cheers,
--
Karel Kubat <karel@e-tunity.com, karel@qbat.org>
Phone: mobile (+31) 6 2956 4861, office (+31) (0)38 46 06 125
PGP fingerprint: D76E 86EC B457 627A 0A87 0B8D DB71 6BCD 1CF2 6CD5
From the Science Exam Papers:
Vegetative propagation is the process by which
one individual manufactures another individual
by accident.
| |
| Bob Walton 2004-10-27, 3:57 pm |
| nntp wrote:
> I found the bug, and could not fix it. It is related to OS related bytes.
>
> Under dos, it looks like this
> valign="top">ив</td>
> In Unix, it looks like this
> valign="top">| </td>
> In my Textpad (in Chinese OS), it looks like this
> valign="top">?/td>
>
> I asked experts about this. They told me that the charactor is missed
> combined with the next charactor to become one charactor.
>
> How do I make sure my perl can parse this correctly? Can perl tell 2 byte
> word and 1 byte word?
>
>
Well, you'll need to:
1. Get a recent version of Perl if you don't already have it (5.8.4 is
fine).
2. Check out the docs for the binmode() function: perldoc -f binmode
3. Determine what sort of encoding is used to represent your character.
If you don't know, you can guess by trying the options available in
the binmode() function. Chances are good it is UTF-8 encoding.
--
Bob Walton
Email: http://bwalton.com/cgi-bin/emailbob.pl
| |
|
|
>
> Well, you'll need to:
>
> 1. Get a recent version of Perl if you don't already have it (5.8.4 is
> fine).
>
> 2. Check out the docs for the binmode() function: perldoc -f binmode
>
> 3. Determine what sort of encoding is used to represent your character.
> If you don't know, you can guess by trying the options available in
> the binmode() function. Chances are good it is UTF-8 encoding.
>
I only need English charactors. Is that possible using s///gs to remove
those suckers? It is totally messed up my program. When I parse, I got
Chinese, French, Spanish, everything, but I only need English.
| |
| Ben Morrow 2004-10-27, 8:56 pm |
|
Quoth karel@e-tunity.com:
> And regarding encodings or character sets: _yes_, Perl can be told to
> regard 2-byte sequences as 1 character (or even more than 2 bytes,
> actually). Try "perldoc -f multibyte" and then play around with the Unicode
> modules.
You mean '-q'. :)
Ben
--
Although few may originate a policy, we are all able to judge it.
- Pericles of Athens, c.430 B.C.
ben@morrow.me.uk
| |
| Bob Walton 2004-10-27, 8:56 pm |
| nntp wrote:
....
> I only need English charactors. Is that possible using s///gs to remove
> those suckers? It is totally messed up my program. When I parse, I got
> Chinese, French, Spanish, everything, but I only need English.
Well, in order to process it with regexen, you would need to know what
encoding it is, and some detail about the encoding in order to properly
trash the correct number of bytes associated with each character. If
the encoding is, for example, UTF-8, it could be that some characters
may take three bytes, or even more. You would have to parse out the
encoding to know how many characters to discard. It would be a *lot*
easier to just do the right thing and let Perl automatically handle it.
--
Bob Walton
Email: http://bwalton.com/cgi-bin/emailbob.pl
| |
|
|
"Karel Kubat" <karel@e-tunity.com> ????
news:417fdccf$0$142$e4fe514c@dreader19.news.xs4all.nl...
> Hi,
>
bytes.[color=darkred]
byte[color=darkred]
>
> This is not a Perl issue per se, and neither an OS-related issue. You're
> dealing with multibyte encodings of characters.
>
> You need to look at the encoding of the original document first. Off the
top
> of my head, in an XML document, it would say something like <?xml
> version="1.0" encoding="....."?>. When the encoding specifier is missing,
> then UTF-8 is the default I think.
>
> Your problem however probably refers to an HTML page, not to an XML
> document. In that case the encoding might be in one of the HTTP headers
> that are sent when a server outputs a page -- that will depend on the
> server configuration.
>
> And regarding encodings or character sets: _yes_, Perl can be told to
> regard 2-byte sequences as 1 character (or even more than 2 bytes,
> actually). Try "perldoc -f multibyte" and then play around with the
Unicode
> modules.
>
> What _is_ the problem you're describing anyway? It might be helpful to
> know..
>
> Cheers,
The first several lines:
<HTML XMLNS:IE>
<head>
<mainD5>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
URL of the page is www.ebay.com
I see charset=ISO-8859-1. Isn't that regular 1 byte encoding?
Can I do
s/\W//gs or s/[^\w]//gs
to remove everything that is not an English charactor or number or < _ . -!
/\?
I read perldoc -q multebytes.
| |
| Bob Walton 2004-10-29, 3:57 am |
| nntp wrote:
> "Karel Kubat" <karel@e-tunity.com> ????
> news:417fdccf$0$142$e4fe514c@dreader19.news.xs4all.nl...
>
....
> The first several lines:
> <HTML XMLNS:IE>
> <head>
> <mainD5>
> <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
> URL of the page is www.ebay.com
When I download http://www.ebay.com, I find there are no characters
present which appear to be encoded in any fashion whatever, nor any
which are not in the ASCII character set (and thus in ISO-8859-1, in the
half of it which does not have the high-order bit set). Are you certain
you are downloading and handling the data in the page correctly?
> I see charset=ISO-8859-1. Isn't that regular 1 byte encoding?
I'm not sure what you mean by "regular 1 byte encoding". If you asking
the question "is ISO-8859-1 a 1-byte encoding", then I guess the answer
is no, ISO-8859-1 isn't an encoding at all. It's just a character set.
Each character in it is one byte (8 bits, that is) long. In
ISO-8859-1, the high-order bit is used for some of the characters (well,
about half of them). If those are displayed using software which uses a
different character set other than ISO-8859-1, the results may appear to
be garbled. Therefore, if you wish to view characters using the
ISO-8859-1 character set, use a viewer that uses the ISO-8859-1
character set. If you want more detail, you might have better luck in a
newsgroup dealing with character sets -- this newsgroup deals with Perl.
Or just look it up yourself. Good references are found in the first
page of results from Google.
....
--
Bob Walton
Email: http://bwalton.com/cgi-bin/emailbob.pl
|
|
|
|
|