Code Comments
Programming Forum and web based access to our favorite programming groups.I found the bug, and could not fix it. It is related to OS related bytes. Under dos, it looks like this valign="top">ив</td> In Unix, it looks like this valign="top">| </td> In my Textpad (in Chinese OS), it looks like this valign="top">?/td> I asked experts about this. They told me that the charactor is missed combined with the next charactor to become one charactor. How do I make sure my perl can parse this correctly? Can perl tell 2 byte word and 1 byte word?
Post Follow-up to this messageHi, > I found the bug, and could not fix it. It is related to OS related bytes. > > Under dos, it looks like this > valign="top">ив</td> > In Unix, it looks like this > valign="top">| </td> > In my Textpad (in Chinese OS), it looks like this > valign="top">?/td> > > I asked experts about this. They told me that the charactor is missed > combined with the next charactor to become one charactor. > How do I make sure my perl can parse this correctly? Can perl tell 2 byte > word and 1 byte word? This is not a Perl issue per se, and neither an OS-related issue. You're dealing with multibyte encodings of characters. You need to look at the encoding of the original document first. Off the top of my head, in an XML document, it would say something like <?xml version="1.0" encoding="....."?>. When the encoding specifier is missing, then UTF-8 is the default I think. Your problem however probably refers to an HTML page, not to an XML document. In that case the encoding might be in one of the HTTP headers that are sent when a server outputs a page -- that will depend on the server configuration. And regarding encodings or character sets: _yes_, Perl can be told to regard 2-byte sequences as 1 character (or even more than 2 bytes, actually). Try "perldoc -f multibyte" and then play around with the Unicode modules. What _is_ the problem you're describing anyway? It might be helpful to know.. Cheers, -- Karel Kubat <karel@e-tunity.com, karel@qbat.org> Phone: mobile (+31) 6 2956 4861, office (+31) (0)38 46 06 125 PGP fingerprint: D76E 86EC B457 627A 0A87 0B8D DB71 6BCD 1CF2 6CD5 From the Science Exam Papers: Vegetative propagation is the process by which one individual manufactures another individual by accident.
Post Follow-up to this messagenntp wrote: > I found the bug, and could not fix it. It is related to OS related bytes. > > Under dos, it looks like this > valign="top">ив</td> > In Unix, it looks like this > valign="top">| </td> > In my Textpad (in Chinese OS), it looks like this > valign="top">?/td> > > I asked experts about this. They told me that the charactor is missed > combined with the next charactor to become one charactor. > > How do I make sure my perl can parse this correctly? Can perl tell 2 byte > word and 1 byte word? > > Well, you'll need to: 1. Get a recent version of Perl if you don't already have it (5.8.4 is fine). 2. Check out the docs for the binmode() function: perldoc -f binmode 3. Determine what sort of encoding is used to represent your character. If you don't know, you can guess by trying the options available in the binmode() function. Chances are good it is UTF-8 encoding. -- Bob Walton Email: http://bwalton.com/cgi-bin/emailbob.pl
Post Follow-up to this message> > Well, you'll need to: > > 1. Get a recent version of Perl if you don't already have it (5.8.4 is > fine). > > 2. Check out the docs for the binmode() function: perldoc -f binmode > > 3. Determine what sort of encoding is used to represent your character. > If you don't know, you can guess by trying the options available in > the binmode() function. Chances are good it is UTF-8 encoding. > I only need English charactors. Is that possible using s///gs to remove those suckers? It is totally messed up my program. When I parse, I got Chinese, French, Spanish, everything, but I only need English.
Post Follow-up to this messageQuoth karel@e-tunity.com: > And regarding encodings or character sets: _yes_, Perl can be told to > regard 2-byte sequences as 1 character (or even more than 2 bytes, > actually). Try "perldoc -f multibyte" and then play around with the Unicod e > modules. You mean '-q'. :) Ben -- Although few may originate a policy, we are all able to judge it. - Pericles of Athens, c.430 B.C. ben@morrow.me.uk
Post Follow-up to this messagenntp wrote: ... > I only need English charactors. Is that possible using s///gs to remove > those suckers? It is totally messed up my program. When I parse, I got > Chinese, French, Spanish, everything, but I only need English. Well, in order to process it with regexen, you would need to know what encoding it is, and some detail about the encoding in order to properly trash the correct number of bytes associated with each character. If the encoding is, for example, UTF-8, it could be that some characters may take three bytes, or even more. You would have to parse out the encoding to know how many characters to discard. It would be a *lot* easier to just do the right thing and let Perl automatically handle it. -- Bob Walton Email: http://bwalton.com/cgi-bin/emailbob.pl
Post Follow-up to this message"Karel Kubat" <karel@e-tunity.com> ???? news:417fdccf$0$142$e4fe514c@dreader19.news.xs4all.nl... > Hi, > bytes. byte > > This is not a Perl issue per se, and neither an OS-related issue. You're > dealing with multibyte encodings of characters. > > You need to look at the encoding of the original document first. Off the top > of my head, in an XML document, it would say something like <?xml > version="1.0" encoding="....."?>. When the encoding specifier is missing, > then UTF-8 is the default I think. > > Your problem however probably refers to an HTML page, not to an XML > document. In that case the encoding might be in one of the HTTP headers > that are sent when a server outputs a page -- that will depend on the > server configuration. > > And regarding encodings or character sets: _yes_, Perl can be told to > regard 2-byte sequences as 1 character (or even more than 2 bytes, > actually). Try "perldoc -f multibyte" and then play around with the Unicode > modules. > > What _is_ the problem you're describing anyway? It might be helpful to > know.. > > Cheers, The first several lines: <HTML XMLNS:IE> <head> <mainD5> <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1"> URL of the page is www.ebay.com I see charset=ISO-8859-1. Isn't that regular 1 byte encoding? Can I do s/\W//gs or s/[^\w]//gs to remove everything that is not an English charactor or number or < _ . -! /\? I read perldoc -q multebytes.
Post Follow-up to this messagenntp wrote: > "Karel Kubat" <karel@e-tunity.com> ???? > news:417fdccf$0$142$e4fe514c@dreader19.news.xs4all.nl... > ... > The first several lines: > <HTML XMLNS:IE> > <head> > <mainD5> > <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1"> > URL of the page is www.ebay.com When I download http://www.ebay.com, I find there are no characters present which appear to be encoded in any fashion whatever, nor any which are not in the ASCII character set (and thus in ISO-8859-1, in the half of it which does not have the high-order bit set). Are you certain you are downloading and handling the data in the page correctly? > I see charset=ISO-8859-1. Isn't that regular 1 byte encoding? I'm not sure what you mean by "regular 1 byte encoding". If you asking the question "is ISO-8859-1 a 1-byte encoding", then I guess the answer is no, ISO-8859-1 isn't an encoding at all. It's just a character set. Each character in it is one byte (8 bits, that is) long. In ISO-8859-1, the high-order bit is used for some of the characters (well, about half of them). If those are displayed using software which uses a different character set other than ISO-8859-1, the results may appear to be garbled. Therefore, if you wish to view characters using the ISO-8859-1 character set, use a viewer that uses the ISO-8859-1 character set. If you want more detail, you might have better luck in a newsgroup dealing with character sets -- this newsgroup deals with Perl. Or just look it up yourself. Good references are found in the first page of results from Google. ... -- Bob Walton Email: http://bwalton.com/cgi-bin/emailbob.pl
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread
Powered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.