Code Comments
Programming Forum and web based access to our favorite programming groups.I'm at a lost... I still need to do some learning about unicode but
basically I'm reading an unicode utf-16le file and have successfully
done so but with one issue. When I print the first line of input the
BOM is still there... I was thinking that the BOM would be striped off
during the open then read process but I was wrong. The print out put
has three characters of garbage at the beginning of the string then
the rest of the output is fine. I have tried to use a regex to remove
it prior to printing without success. I'll post the code below and
see if someone can see my issue. Thanks in advance!
#############code
#!/usr/bin/perl -w
#
use strict;
use warnings;
use Cwd;
use Encode;
my $file1_in =3D "uft-16 LE file.txt";
my $fh1;
my $line;
my $rc;
open $fh1,"<:raw:encoding(UTF16-LE):crlf:utf8", $file1_in or die
"can't open $file1_in: $!";
binmode STDOUT, ":utf8";
foreach $line (<$fh1> ) {
chomp($line);
print "$line\n";
}
close $fh1;
#########output
=A1=C9=A8[=A9=B42007-05-11 15:34:15.10 Server Microsoft SQL Server 2005
=
-
9.00.2050.00 (X64)
=2E..rest of the output is normal...
Post Follow-up to this messagetewilk@gmail.com schreef:
> [...] I'm reading an unicode utf-16le file and have successfully
> done so but with one issue. When I print the first line of input the
> BOM is still there...
By specifying the "le", you express that you already know the byte
order.
The U+FEFF is then read as the "zero-width no-break space", and not
as the BOM.
So either toss the "le" or toss the BOM character: s/^\x{FEFF)//;
--
Affijn, Ruud
"Gewoon is een tijger."
Post Follow-up to this messageOn Jan 24, 7:35 pm, rvtol+n...@isolution.nl (Dr.Ruud) wrote:
> tew...@gmail.com schreef:
>
>
> By specifying the "le", you express that you already know the byte
> order.
> The U+FEFF is then read as the "zero-width no-break space", and not
> as the BOM.
>
> So either toss the "le" or toss the BOM character: s/^\x{FEFF)//;
>
> --
> Affijn, Ruud
>
> "Gewoon is een tijger."
Great! both worked. The thing I still don't understand is that in the
file the BOM is FFFE not FEFF so I have already tried to use s/
^x{FFFE}//; with no success but your feedback worked with the s/
^{FEFF}//; it is in reverse order for some reason. Now I need to read
further into "zero-width no-break space", not sure that I understand
why it is called that and not BOM. Dealing with unicode at the moment
is over my head a bit so thanks very much for the fix to what was a
simple change. Off to find more material to read about this subject
matter, thanks again!
Post Follow-up to this messageOn Jan 25, 2008 10:06 AM, tewilk@gmail.com <tewilk@gmail.com> wrote:
snip
> Great! both worked. The thing I still don't understand is that in the
> file the BOM is FFFE not FEFF
snip
This is because it is little endian, if it were a big endian file it
would be FEFF. The character is the same, but the order of the bytes
change depending on the endian-ness of the file. The BOM isn't a
marker that says the file is one endian or another, it is a character
that is known in advance that lets you easily tell which endian the
file is.
snip
> so I have already tried to use s/
> ^x{FFFE}//; with no success but your feedback worked with the s/
> ^{FEFF}//; it is in reverse order for some reason.
snip
Perl uses the Unicode character number for "\x{}", so ZERO WIDTH
NO-BREAK SPACE is "\x{FEFF}" even if it is written to the file in
little-endian bytes FF FE. Avoid confusing the encoding of Unicode
with Unicode itself. For instance, The UTF-8 encoding of "\x{FEFF}"
is EF BB BF.
snip
>Now I need to read
> further into "zero-width no-break space", not sure that I understand
> why it is called that and not BOM. Dealing with unicode at the moment
> is over my head a bit so thanks very much for the fix to what was a
> simple change. Off to find more material to read about this subject
> matter, thanks again!
snip
from http://en.wikipedia.org/wiki/Byte_Order_Mark
In most character encodings the BOM is a pattern which
is unlikely to be seen in other contexts (it would usually
look like a sequence of obscure control codes). If a BOM
is misinterpreted as an actual character within Unicode
text then it will generally be invisible due to the fact it is a
zero-width no-break space. Use of the U+FEFF character
for non-BOM purposes has been deprecated in Unicode
3.2 (which provides an alternative, U+2060, for those
other purposes), allowing U+FEFF to be used solely with
the semantic of BOM.
Also, there is a nice chart here:
http://www.websina.com/bugzero/kb/unicode-bom.html
Post Follow-up to this messageOn Jan 25, 10:30 am, chas.ow...@gmail.com (Chas. Owens) wrote:
> On Jan 25, 2008 10:06 AM, tew...@gmail.com <tew...@gmail.com> wrote:
> snip> Great! both worked. The thing I still don't understand is that in t
he
>
> snip
>
> This is because it is little endian, if it were a big endian file it
> would be FEFF. The character is the same, but the order of the bytes
> change depending on the endian-ness of the file. The BOM isn't a
> marker that says the file is one endian or another, it is a character
> that is known in advance that lets you easily tell which endian the
> file is.
>
> snip> so I have already tried to use s/
>
> snip
>
> Perl uses the Unicode character number for "\x{}", so ZERO WIDTH
> NO-BREAK SPACE is "\x{FEFF}" even if it is written to the file in
> little-endian bytes FF FE. Avoid confusing the encoding of Unicode
> with Unicode itself. For instance, The UTF-8 encoding of "\x{FEFF}"
> is EF BB BF.
>
> snip>Now I need to read
>
> snip
>
> fromhttp://en.wikipedia.org/wiki/Byte_Order_Mark
> In most character encodings the BOM is a pattern which
> is unlikely to be seen in other contexts (it would usually
> look like a sequence of obscure control codes). If a BOM
> is misinterpreted as an actual character within Unicode
> text then it will generally be invisible due to the fact it is a
> zero-width no-break space. Use of the U+FEFF character
> for non-BOM purposes has been deprecated in Unicode
> 3.2 (which provides an alternative, U+2060, for those
> other purposes), allowing U+FEFF to be used solely with
> the semantic of BOM.
>
> Also, there is a nice chart here:[url]http://www.websina.com/bugzero/kb/unicode-bom.html[/url
]
Thanks for the feedback... I will look into the sites you sent for
additional information. Thanks!
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread
Powered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.