Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

dealing unicode output
I'm at a lost...  I still need to do some learning about unicode but
basically I'm reading an unicode utf-16le file and have successfully
done so but with one issue.  When I print the first line of input the
BOM is still there... I was thinking that the BOM would be striped off
during the open then read process but I was wrong.  The print out put
has three characters of garbage at the beginning of the string then
the rest of the output is fine.  I have tried to use a regex to remove
it prior to printing without success.  I'll post the code below and
see if someone can see my issue.  Thanks in advance!

#############code
#!/usr/bin/perl -w
#
use strict;
use warnings;
use Cwd;
use Encode;

my $file1_in =3D "uft-16 LE file.txt";
my $fh1;
my $line;
my $rc;

open $fh1,"<:raw:encoding(UTF16-LE):crlf:utf8", $file1_in or die
"can't open $file1_in: $!";
binmode STDOUT, ":utf8";

foreach $line (<$fh1> ) {
chomp($line);
print "$line\n";
}
close $fh1;

#########output
=A1=C9=A8[=A9=B42007-05-11 15:34:15.10 Server      Microsoft SQL Server 2005
=
-
9.00.2050.00 (X64)
=2E..rest of the output is normal...


Report this thread to moderator Post Follow-up to this message
Old Post
Tewilk@Gmail.Com
01-25-08 12:07 AM


Re: dealing unicode output
tewilk@gmail.com schreef:

> [...] I'm reading an unicode utf-16le file and have successfully
> done so but with one issue.  When I print the first line of input the
> BOM is still there...

By specifying the "le", you express that you already know the byte
order.
The U+FEFF is then read as the "zero-width no-break space", and not
as the BOM.

So either toss the "le" or toss the BOM character: s/^\x{FEFF)//;

--
Affijn, Ruud

"Gewoon is een tijger."

Report this thread to moderator Post Follow-up to this message
Old Post
Dr.Ruud
01-25-08 12:07 AM


Re: dealing unicode output
On Jan 24, 7:35 pm, rvtol+n...@isolution.nl (Dr.Ruud) wrote:
> tew...@gmail.com schreef:
> 
>
> By specifying the "le", you express that you already know the byte
> order.
> The U+FEFF is then read as the "zero-width no-break space", and not
> as the BOM.
>
> So either toss the "le" or toss the BOM character: s/^\x{FEFF)//;
>
> --
> Affijn, Ruud
>
> "Gewoon is een tijger."

Great! both worked.  The thing I still don't understand is that in the
file the BOM is FFFE not FEFF so I have already tried to use s/
^x{FFFE}//; with no success but your feedback worked with the s/
^{FEFF}//; it is in reverse order for some reason.  Now I need to read
further into "zero-width no-break space", not sure that I understand
why it is called that and not BOM.  Dealing with unicode at the moment
is over my head a bit so thanks very much for the fix to what was a
simple change.  Off to find more material to read about this subject
matter, thanks again!


Report this thread to moderator Post Follow-up to this message
Old Post
Tewilk@Gmail.Com
01-26-08 12:06 AM


Re: dealing unicode output
On Jan 25, 2008 10:06 AM, tewilk@gmail.com <tewilk@gmail.com> wrote:
snip
> Great! both worked.  The thing I still don't understand is that in the
> file the BOM is FFFE not FEFF
snip

This is because it is little endian, if it were a big endian file it
would be FEFF.  The character is the same, but the order of the bytes
change depending on the endian-ness of the file.  The BOM isn't a
marker that says the file is one endian or another, it is a character
that is known in advance that lets you easily tell which endian the
file is.

snip
> so I have already tried to use s/
> ^x{FFFE}//; with no success but your feedback worked with the s/
> ^{FEFF}//; it is in reverse order for some reason.
snip

Perl uses the Unicode character number for "\x{}", so ZERO WIDTH
NO-BREAK SPACE is "\x{FEFF}" even if it is written to the file in
little-endian bytes FF FE.  Avoid confusing the encoding of Unicode
with Unicode itself.  For instance, The UTF-8 encoding of "\x{FEFF}"
is EF BB BF.

snip
>Now I need to read
> further into "zero-width no-break space", not sure that I understand
> why it is called that and not BOM.  Dealing with unicode at the moment
> is over my head a bit so thanks very much for the fix to what was a
> simple change.  Off to find more material to read about this subject
> matter, thanks again!
snip

from http://en.wikipedia.org/wiki/Byte_Order_Mark
In most character encodings the BOM is a pattern which
is unlikely to be seen in other contexts (it would usually
look like a sequence of obscure control codes). If a BOM
is misinterpreted as an actual character within Unicode
text then it will generally be invisible due to the fact it is a
zero-width no-break space. Use of the U+FEFF character
for non-BOM purposes has been deprecated in Unicode
3.2 (which provides an alternative, U+2060, for those
other purposes), allowing U+FEFF to be used solely with
the semantic of BOM.

Also, there is a nice chart here:
http://www.websina.com/bugzero/kb/unicode-bom.html

Report this thread to moderator Post Follow-up to this message
Old Post
Chas. Owens
01-26-08 12:06 AM


Re: dealing unicode output
On Jan 25, 10:30 am, chas.ow...@gmail.com (Chas. Owens) wrote:
> On Jan 25, 2008 10:06 AM, tew...@gmail.com <tew...@gmail.com> wrote:
> snip> Great! both worked.  The thing I still don't understand is that in t
he 
>
> snip
>
> This is because it is little endian, if it were a big endian file it
> would be FEFF.  The character is the same, but the order of the bytes
> change depending on the endian-ness of the file.  The BOM isn't a
> marker that says the file is one endian or another, it is a character
> that is known in advance that lets you easily tell which endian the
> file is.
>
> snip> so I have already tried to use s/ 
>
> snip
>
> Perl uses the Unicode character number for "\x{}", so ZERO WIDTH
> NO-BREAK SPACE is "\x{FEFF}" even if it is written to the file in
> little-endian bytes FF FE.  Avoid confusing the encoding of Unicode
> with Unicode itself.  For instance, The UTF-8 encoding of "\x{FEFF}"
> is EF BB BF.
>
> snip>Now I need to read 
>
> snip
>
> fromhttp://en.wikipedia.org/wiki/Byte_Order_Mark
>     In most character encodings the BOM is a pattern which
>     is unlikely to be seen in other contexts (it would usually
>     look like a sequence of obscure control codes). If a BOM
>     is misinterpreted as an actual character within Unicode
>     text then it will generally be invisible due to the fact it is a
>     zero-width no-break space. Use of the U+FEFF character
>     for non-BOM purposes has been deprecated in Unicode
>     3.2 (which provides an alternative, U+2060, for those
>     other purposes), allowing U+FEFF to be used solely with
>     the semantic of BOM.
>
> Also, there is a nice chart here:[url]http://www.websina.com/bugzero/kb/unicode-bom.html[/url
]

Thanks for the feedback...  I will look into the sites you sent for
additional information. Thanks!


Report this thread to moderator Post Follow-up to this message
Old Post
Tewilk@Gmail.Com
01-26-08 12:06 AM


Sponsored Links




Last Thread Next Thread Next
Search this forum -> 
Post New Thread

PERL Beginners archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 03:05 PM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.