Home > Archive > AWK > July 2006 > handling a byte order mark (BOM) in input text
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
handling a byte order mark (BOM) in input text
|
|
| Andrew Schorr 2006-07-12, 6:56 pm |
| Hi,
Is there a clear sense of how files with a BOM at the start should be
handled?
For example, running this command:
printf "\xef\xbb\xbfhello\nhello\nhello\n" |\
LC_ALL=en_US.UTF-8 gawk '/^he/'
gives this output:
hello
hello
So the first hello does not match because of the BOM at the start of
the file.
Is this the proper behavior, or should awk ignore the leading BOM?
Regards,
Andy
| |
| Xicheng Jia 2006-07-12, 6:56 pm |
| Andrew Schorr wrote:
> Hi,
>
> Is there a clear sense of how files with a BOM at the start should be
> handled?
>
> For example, running this command:
>
> printf "\xef\xbb\xbfhello\nhello\nhello\n" |\
> LC_ALL=en_US.UTF-8 gawk '/^he/'
>
> gives this output:
>
> hello
> hello
>
> So the first hello does not match because of the BOM at the start of
> the file.
> Is this the proper behavior, or should awk ignore the leading BOM?
^ is an anchor which means the start of a line. so /^he/ only matches
'he' located at the beginning of a line.. there are some other
characters before your first hello, so it's not a match... to match all
three cases, use /he/ instead of /^he/.
Xicheng
| |
| Jürgen Kahrs 2006-07-12, 6:56 pm |
| Andrew Schorr wrote:
> For example, running this command:
>
> printf "\xef\xbb\xbfhello\nhello\nhello\n" |\
> LC_ALL=en_US.UTF-8 gawk '/^he/'
>
> gives this output:
>
> hello
> hello
>
> So the first hello does not match because of the BOM at the start of
> the file.
> Is this the proper behavior, or should awk ignore the leading BOM?
This should depend on the locale. In a UTF
locale, BOM has a special meaning and should
be ignored. In a "C" locale, BOM is just a
strange 3-byte binary string and should be
treated as input data.
| |
| Jürgen Kahrs 2006-07-12, 6:56 pm |
| Xicheng Jia wrote:
> ^ is an anchor which means the start of a line. so /^he/ only matches
> 'he' located at the beginning of a line.. there are some other
> characters before your first hello, so it's not a match... to match all
> three cases, use /he/ instead of /^he/.
Andrew knows this. It is perfectly clear to him
what happens here (he even knows the source code
of the GAWK interpreter).
Andrew points to the fact that this behavior
(BOM being interpreted as data) is
1. not what users would expect
2. an open point in the spec of the interpreter
| |
| Xicheng Jia 2006-07-12, 6:56 pm |
| J=FCrgen Kahrs wrote:
> Xicheng Jia wrote:
>
>
> Andrew knows this. It is perfectly clear to him
> what happens here (he even knows the source code
> of the GAWK interpreter).
>
> Andrew points to the fact that this behavior
> (BOM being interpreted as data) is
>
> 1. not what users would expect
> 2. an open point in the spec of the interpreter
Sorry, I was just wondering why he would use the regex /^he/, which
obviously doesnot match strings having whatever other non-newline
characters before *he*. :-)
Xicheng
| |
| Juergen Kahrs 2006-07-13, 3:56 am |
| Xicheng Jia wrote:
> Sorry, I was just wondering why he would use the regex /^he/, which
> obviously doesnot match strings having whatever other non-newline
> characters before *he*. :-)
Yes, he did this intentionally to trigger the problem
in the interpreter. This test case was originally posted
in the xmlgawk-develipers mailing list by one of our
Japanese developers (Kimura Koishi) who reported it
from a Japanese user facing the problem.
|
|
|
|
|