Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

How do I parse this Charactor? 2byte vs 1byte
I found the bug, and could not fix it. It is related to OS related bytes.

Under dos, it looks like this
valign="top">ив</td>
In Unix, it looks like this
valign="top">| </td>
In my Textpad (in Chinese OS), it looks like this
valign="top">?/td>

I asked experts about this. They told me that the charactor is missed
combined with the next charactor to become one charactor.

How do I make sure my perl can parse this correctly? Can perl tell 2 byte
word and 1 byte word?



Report this thread to moderator Post Follow-up to this message
Old Post
nntp
10-27-04 08:57 PM


Re: How do I parse this Charactor? 2byte vs 1byte
Hi,

> I found the bug, and could not fix it. It is related to OS related bytes.
>
> Under dos, it looks like this
> valign="top">ив</td>
> In Unix, it looks like this
> valign="top">| </td>
> In my Textpad (in Chinese OS), it looks like this
> valign="top">?/td>
>
> I asked experts about this. They told me that the charactor is missed
> combined with the next charactor to become one charactor.
> How do I make sure my perl can parse this correctly? Can perl tell 2 byte
> word and 1 byte word?

This is not a Perl issue per se, and neither an OS-related issue. You're
dealing with multibyte encodings of characters.

You need to look at the encoding of the original document first. Off the top
of my head, in an XML document, it would say something like <?xml
version="1.0" encoding="....."?>. When the encoding specifier is missing,
then UTF-8 is the default I think.

Your problem however probably refers to an HTML page, not to an XML
document. In that case the encoding might be in one of the HTTP headers
that are sent when a server outputs a page -- that will depend on the
server configuration.

And regarding encodings or character sets:  _yes_, Perl can be told to
regard 2-byte sequences as 1 character (or even more than 2 bytes,
actually). Try "perldoc -f multibyte" and then play around with the Unicode
modules.

What _is_ the problem you're describing anyway? It might be helpful to
know..

Cheers,
--
Karel Kubat <karel@e-tunity.com, karel@qbat.org>
Phone: mobile (+31) 6 2956 4861, office (+31) (0)38 46 06 125
PGP fingerprint: D76E 86EC B457 627A 0A87  0B8D DB71 6BCD 1CF2 6CD5

From the Science Exam Papers:
Vegetative propagation is the process by which
one individual manufactures another individual
by accident.


Report this thread to moderator Post Follow-up to this message
Old Post
Karel Kubat
10-27-04 08:57 PM


Re: How do I parse this Charactor? 2byte vs 1byte
nntp wrote:

> I found the bug, and could not fix it. It is related to OS related bytes.
>
> Under dos, it looks like this
> valign="top">ив</td>
> In Unix, it looks like this
> valign="top">| </td>
> In my Textpad (in Chinese OS), it looks like this
> valign="top">?/td>
>
> I asked experts about this. They told me that the charactor is missed
> combined with the next charactor to become one charactor.
>
> How do I make sure my perl can parse this correctly? Can perl tell 2 byte
> word and 1 byte word?
>
>

Well, you'll need to:

1.  Get a recent version of Perl if you don't already have it (5.8.4 is
fine).

2.  Check out the docs for the binmode() function:  perldoc -f binmode

3.  Determine what sort of encoding is used to represent your character.
If you don't know, you can guess by trying the options available in
the binmode() function.  Chances are good it is UTF-8 encoding.

--
Bob Walton
Email: http://bwalton.com/cgi-bin/emailbob.pl

Report this thread to moderator Post Follow-up to this message
Old Post
Bob Walton
10-27-04 08:57 PM


Re: How do I parse this Charactor? 2byte vs 1byte

>
> Well, you'll need to:
>
> 1.  Get a recent version of Perl if you don't already have it (5.8.4 is
> fine).
>
> 2.  Check out the docs for the binmode() function:  perldoc -f binmode
>
> 3.  Determine what sort of encoding is used to represent your character.
>   If you don't know, you can guess by trying the options available in
> the binmode() function.  Chances are good it is UTF-8 encoding.
>

I only need English charactors. Is that possible using s///gs to remove
those suckers? It is totally messed up my program. When I parse, I got
Chinese, French, Spanish, everything, but I only need English.



Report this thread to moderator Post Follow-up to this message
Old Post
nntp
10-27-04 08:57 PM


Re: How do I parse this Charactor? 2byte vs 1byte
Quoth karel@e-tunity.com:
> And regarding encodings or character sets:  _yes_, Perl can be told to
> regard 2-byte sequences as 1 character (or even more than 2 bytes,
> actually). Try "perldoc -f multibyte" and then play around with the Unicod
e
> modules.

You mean '-q'. :)

Ben

--
Although few may originate a policy, we are all able to judge it.
- Pericles of Athens, c.430 B.C.
ben@morrow.me.uk

Report this thread to moderator Post Follow-up to this message
Old Post
Ben Morrow
10-28-04 01:56 AM


Re: How do I parse this Charactor? 2byte vs 1byte
nntp wrote:

...

> I only need English charactors. Is that possible using s///gs to remove
> those suckers? It is totally messed up my program. When I parse, I got
> Chinese, French, Spanish, everything, but I only need English.

Well, in order to process it with regexen, you would need to know what
encoding it is, and some detail about the encoding in order to properly
trash the correct number of bytes associated with each character.  If
the encoding is, for example, UTF-8, it could be that some characters
may take three bytes, or even more.  You would have to parse out the
encoding to know how many characters to discard.  It would be a *lot*
easier to just do the right thing and let Perl automatically handle it.

--
Bob Walton
Email: http://bwalton.com/cgi-bin/emailbob.pl

Report this thread to moderator Post Follow-up to this message
Old Post
Bob Walton
10-28-04 01:56 AM


Re: How do I parse this Charactor? 2byte vs 1byte
"Karel Kubat" <karel@e-tunity.com> ????
news:417fdccf$0$142$e4fe514c@dreader19.news.xs4all.nl...
> Hi,
> 
bytes. 
byte 
>
> This is not a Perl issue per se, and neither an OS-related issue. You're
> dealing with multibyte encodings of characters.
>
> You need to look at the encoding of the original document first. Off the
top
> of my head, in an XML document, it would say something like <?xml
> version="1.0" encoding="....."?>. When the encoding specifier is missing,
> then UTF-8 is the default I think.
>
> Your problem however probably refers to an HTML page, not to an XML
> document. In that case the encoding might be in one of the HTTP headers
> that are sent when a server outputs a page -- that will depend on the
> server configuration.
>
> And regarding encodings or character sets:  _yes_, Perl can be told to
> regard 2-byte sequences as 1 character (or even more than 2 bytes,
> actually). Try "perldoc -f multibyte" and then play around with the
Unicode
> modules.
>
> What _is_ the problem you're describing anyway? It might be helpful to
> know..
>
> Cheers,

The first several lines:
<HTML XMLNS:IE>
<head>
<mainD5>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
URL of the page is www.ebay.com
I see charset=ISO-8859-1. Isn't that regular 1 byte encoding?

Can I do
s/\W//gs  or s/[^\w]//gs
to remove everything that is not an English charactor or number or < _ . -!
/\?

I read perldoc -q multebytes.



Report this thread to moderator Post Follow-up to this message
Old Post
nntp
10-28-04 08:56 AM


Re: How do I parse this Charactor? 2byte vs 1byte
nntp wrote:

> "Karel Kubat" <karel@e-tunity.com> ????
> news:417fdccf$0$142$e4fe514c@dreader19.news.xs4all.nl...
>
...
> The first several lines:
> <HTML XMLNS:IE>
> <head>
> <mainD5>
> <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
> URL of the page is www.ebay.com

When I download http://www.ebay.com, I find there are no characters
present which appear to be encoded in any fashion whatever, nor any
which are not in the ASCII character set (and thus in ISO-8859-1, in the
half of it which does not have the high-order bit set).  Are you certain
you are downloading and handling the data in the page correctly?

> I see charset=ISO-8859-1. Isn't that regular 1 byte encoding?

I'm not sure what you mean by "regular 1 byte encoding".  If you asking
the question "is ISO-8859-1 a 1-byte encoding", then I guess the answer
is no, ISO-8859-1 isn't an encoding at all.  It's just a character set.
Each character in it is one byte (8 bits, that is) long.  In
ISO-8859-1, the high-order bit is used for some of the characters (well,
about half of them).  If those are displayed using software which uses a
different character set other than ISO-8859-1, the results may appear to
be garbled.  Therefore, if you wish to view characters using the
ISO-8859-1 character set, use a viewer that uses the ISO-8859-1
character set.  If you want more detail, you might have better luck in a
newsgroup dealing with character sets -- this newsgroup deals with Perl.
Or just look it up yourself.  Good references are found in the first
page of results from Google.

...

--
Bob Walton
Email: http://bwalton.com/cgi-bin/emailbob.pl

Report this thread to moderator Post Follow-up to this message
Old Post
Bob Walton
10-29-04 08:57 AM


Sponsored Links




Last Thread Next Thread Next
Search this forum -> 
Post New Thread

PERL Miscellaneous archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 04:48 AM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.