For Programmers: Free Programming Magazines  


Home > Archive > AWK > July 2006 > handling a byte order mark (BOM) in input text









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author handling a byte order mark (BOM) in input text
Andrew Schorr

2006-07-12, 6:56 pm

Hi,

Is there a clear sense of how files with a BOM at the start should be
handled?

For example, running this command:

printf "\xef\xbb\xbfhello\nhello\nhello\n" |\
LC_ALL=en_US.UTF-8 gawk '/^he/'

gives this output:

hello
hello

So the first hello does not match because of the BOM at the start of
the file.
Is this the proper behavior, or should awk ignore the leading BOM?

Regards,
Andy

Xicheng Jia

2006-07-12, 6:56 pm

Andrew Schorr wrote:
> Hi,
>
> Is there a clear sense of how files with a BOM at the start should be
> handled?
>
> For example, running this command:
>
> printf "\xef\xbb\xbfhello\nhello\nhello\n" |\
> LC_ALL=en_US.UTF-8 gawk '/^he/'
>
> gives this output:
>
> hello
> hello
>
> So the first hello does not match because of the BOM at the start of
> the file.
> Is this the proper behavior, or should awk ignore the leading BOM?


^ is an anchor which means the start of a line. so /^he/ only matches
'he' located at the beginning of a line.. there are some other
characters before your first hello, so it's not a match... to match all
three cases, use /he/ instead of /^he/.



Xicheng

Jürgen Kahrs

2006-07-12, 6:56 pm

Andrew Schorr wrote:

> For example, running this command:
>
> printf "\xef\xbb\xbfhello\nhello\nhello\n" |\
> LC_ALL=en_US.UTF-8 gawk '/^he/'
>
> gives this output:
>
> hello
> hello
>
> So the first hello does not match because of the BOM at the start of
> the file.
> Is this the proper behavior, or should awk ignore the leading BOM?


This should depend on the locale. In a UTF
locale, BOM has a special meaning and should
be ignored. In a "C" locale, BOM is just a
strange 3-byte binary string and should be
treated as input data.
Jürgen Kahrs

2006-07-12, 6:56 pm

Xicheng Jia wrote:

> ^ is an anchor which means the start of a line. so /^he/ only matches
> 'he' located at the beginning of a line.. there are some other
> characters before your first hello, so it's not a match... to match all
> three cases, use /he/ instead of /^he/.


Andrew knows this. It is perfectly clear to him
what happens here (he even knows the source code
of the GAWK interpreter).

Andrew points to the fact that this behavior
(BOM being interpreted as data) is

1. not what users would expect
2. an open point in the spec of the interpreter
Xicheng Jia

2006-07-12, 6:56 pm

J=FCrgen Kahrs wrote:
> Xicheng Jia wrote:
>
>
> Andrew knows this. It is perfectly clear to him
> what happens here (he even knows the source code
> of the GAWK interpreter).
>
> Andrew points to the fact that this behavior
> (BOM being interpreted as data) is
>
> 1. not what users would expect
> 2. an open point in the spec of the interpreter


Sorry, I was just wondering why he would use the regex /^he/, which
obviously doesnot match strings having whatever other non-newline
characters before *he*. :-)

Xicheng

Juergen Kahrs

2006-07-13, 3:56 am

Xicheng Jia wrote:

> Sorry, I was just wondering why he would use the regex /^he/, which
> obviously doesnot match strings having whatever other non-newline
> characters before *he*. :-)


Yes, he did this intentionally to trigger the problem
in the interpreter. This test case was originally posted
in the xmlgawk-develipers mailing list by one of our
Japanese developers (Kimura Koishi) who reported it
from a Japanese user facing the problem.
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com