For Programmers: Free Programming Magazines  


Home > Archive > AWK > March 2007 > Xgawk parsing bug with umlauts in comments?









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Xgawk parsing bug with umlauts in comments?
Janis

2007-03-05, 6:58 pm

Is this a known bug in xgawk or in the expat parser that will disrupt
the complete parsing process if there is an umlaut in a comment at the
beginning (or maybe even anywhere in the data)?

My testcases contain lines like <!-- F=FCr Generation 1 -->

After removing the '=FC' everything seems to be parsed quite fine.

Janis

Klaus Alexander Seistrup

2007-03-05, 6:58 pm

Janis wrote:

> Is this a known bug in xgawk or in the expat parser that will
> disrupt the complete parsing process if there is an umlaut in
> a comment at the beginning (or maybe even anywhere in the data)?


Make sure any non-ASCII characters are encoded using the same charset
that xgawk expects (probably utf-8).

> My testcases contain lines like <!-- Für Generation 1 -->
>
> After removing the 'ü' everything seems to be parsed quite fine.


I assume that the 'ü' is written in e.g. iso-8859-1. What happens
if you write the 'ü' in utf-8?

Cheers,

--
Klaus Alexander Seistrup
Tv-fri medielicensbetaler
http://klaus.seistrup.dk/
Jürgen Kahrs

2007-03-05, 6:58 pm

Janis wrote:

> Is this a known bug in xgawk or in the expat parser that will disrupt
> the complete parsing process if there is an umlaut in a comment at the
> beginning (or maybe even anywhere in the data)?
>
> My testcases contain lines like <!-- Für Generation 1 -->


If the XML data starts with this comment and then
the data is not well-formed. In this case, XMLgawk
fill not continue parsing (since there is no XML data).

If the comment with the Umlaut occurs at any other
place, then the XML data is well-formed. In this
case, I cannot reproduce any problems.

If you still see a problem, please post your XML data.
Jürgen Kahrs

2007-03-05, 6:58 pm

Klaus Alexander Seistrup wrote:

> Make sure any non-ASCII characters are encoded using the same charset
> that xgawk expects (probably utf-8).


This is probably the right explanation for Janis' problems.

>
> I assume that the 'ü' is written in e.g. iso-8859-1. What happens
> if you write the 'ü' in utf-8?


This didnt cause any problems in my tests.
XMLgawk refused to parse the data only when
there was an encoding like US-ASCII in the
XML header and later an Umlaut in the data.
Janis

2007-03-06, 3:58 am

On 5 Mrz., 18:57, J=FCrgen Kahrs <Juergen.KahrsDELETET...@vr-web.de>
wrote:
> Janis wrote:
>
>
> If the XML data starts with this comment and then
> the data is not well-formed. In this case, XMLgawk
> fill not continue parsing (since there is no XML data).
>
> If the comment with the Umlaut occurs at any other
> place, then the XML data is well-formed. In this
> case, I cannot reproduce any problems.
>
> If you still see a problem, please post your XML data.


I reduced the XML data to the essentials...

Here is the version of the XML-data that triggers _no bug_...

<?xml version=3D"1.0" encoding=3D"ISO-8859-1"?>
<!-- Fuer Generation 1 -->
<indata>
<x id=3D"1.2">
<y status=3D"1">
<a>A</a>
<b>B</b>
<c>2010-07-01</c>
</y>
</x>
</indata>

..=2E.and creates correct output (blank lines in between removed)...

A
B
2010-07-01


And here is the data version that triggers the bug...

<?xml version=3D"1.0" encoding=3D"ISO-8859-1"?>
<!-- F=FCr Generation 1 -->
<indata>
<x id=3D"1.2">
<y status=3D"1">
<a>A</a>
<b>B</b>
<c>2010-07-01</c>
</y>
</x>
</indata>

The only difference is the character '=FC' in the comment.
The raw character encoding (od -x) shows 8-bit characters, only.
Both versions have been classified as valid XML by an online XML-
checker.

(Windows XP platform if it matters.)

Janis

Janis

2007-03-06, 7:57 am

On 5 Mrz., 18:51, Klaus Alexander Seistrup <k...@seistrup.dk> wrote:
> Janis wrote:
>
> Make sure any non-ASCII characters are encoded using the same charset
> that xgawk expects (probably utf-8).


The observed bug appeared with an ISO 8859-1 character encoding; both,
the specified character set, and the coded data were ISO Latin 1.

I had another data set encoded in UTF-8, and specified as UTF-8 within
the XML file; and the same problem could be observed.

Erasing all 8-bit characters from the XML-comments "solved" that
parsing problem. Though that can't be the solution, it's at best a
workaround.

Janis

>
>
>
> I assume that the '=FC' is written in e.g. iso-8859-1. What happens
> if you write the '=FC' in utf-8?
>
> Cheers,
>
> --
> Klaus Alexander Seistrup
> Tv-fri medielicensbetalerhttp://klaus.seistrup.dk/



Juergen Kahrs

2007-03-06, 6:57 pm

Janis wrote:

> The observed bug appeared with an ISO 8859-1 character encoding; both,
> the specified character set, and the coded data were ISO Latin 1.
>
> I had another data set encoded in UTF-8, and specified as UTF-8 within
> the XML file; and the same problem could be observed.
>
> Erasing all 8-bit characters from the XML-comments "solved" that
> parsing problem. Though that can't be the solution, it's at best a
> workaround.


It looks like Manuel Collado's email didnt reach you.
Manuel posted to our internal mailing list what he thinks
is the real problem:

> The problem is that 'locale' doesn't work in Windows.
> This let xgawk to set the internal encoding to US-ASCII, so accented
> characters are not representable in this encoding.
>
> The solution is probably to explicitly set-up the encoding in the BEGIN
> block:
>
> BEGIN {
> ...
> XMLCHARSET = "ISO-8859-1"
> }
> ...
>
> Hope this helps.

Jürgen Kahrs

2007-03-06, 6:57 pm

Janis wrote:

> I reduced the XML data to the essentials...


Yes, good idea.

> The only difference is the character 'ü' in the comment.
> The raw character encoding (od -x) shows 8-bit characters, only.
> Both versions have been classified as valid XML by an online XML-
> checker.


I also used xmllint and xgawk to well-formedness.
No problem on my Linux machine.

> (Windows XP platform if it matters.)


This is probably the reason for the problem.
Manuel Collado's workaround should solve the problem.
Janis

2007-03-06, 6:57 pm

On 6 Mrz., 15:12, Juergen Kahrs <Juergen.KahrsDELETET...@vr-web.de>
wrote:
> Janis wrote:
>
>
>
> It looks like Manuel Collado's email didnt reach you.
> Manuel posted to our internal mailing list what he thinks
> is the real problem:


It indeed had not (yet) reached me (and I'll respond to him tonight
per mail). Thank you for posting his suggestion.
[color=darkred]
>
>
>
>

Yes, that fixes the problem for me. Thanks!
(And I feel better now being able to remove my 'tr -d' workaround.)

As long as in my application context the non-ASCII characters just
appear within comments (as non-operative data), I think it suffices to
hard code the XML character set as ISO-8859-1 even if UTF-8 may also
be possible to come.

Janis

Jürgen Kahrs

2007-03-06, 6:57 pm

Janis wrote:

>
> Yes, that fixes the problem for me. Thanks!
> (And I feel better now being able to remove my 'tr -d' workaround.)


That's fine.

> As long as in my application context the non-ASCII characters just
> appear within comments (as non-operative data), I think it suffices to
> hard code the XML character set as ISO-8859-1 even if UTF-8 may also
> be possible to come.


If I understood Manuel's workaround correctly,
then you should always set XMLCHARSET to the
encoding name that is used in the XML data.
This general rule should even work with Japanese
encodings.
Janis

2007-03-06, 6:57 pm

On 6 Mrz., 17:42, J=FCrgen Kahrs <Juergen.KahrsDELETET...@vr-web.de>
wrote:
>
> If I understood Manuel's workaround correctly,
> then you should always set XMLCHARSET to the
> encoding name that is used in the XML data.
> This general rule should even work with Japanese
> encodings.


Yes, I understand that; following that wouldn't produce any problem to
be possibly frightened of, but...

The problem is that my data comes from several sources and I have just
the one XML processor. If I'd want to cover all possible encoding
settings in the data I'd have to parse the data to extract the string
to assign it to the builtin XML variable. (A quick try of that
approach in xml-mode didn't lead me far, though. I'll have to consult
the manual again whether it is possible with the xgawk features to
simply access the encoding data within <?...?>,[*] which would be
better than any separate preprocessing.

Janis

[*] If that's not possible would it be a helpful feature?

Jürgen Kahrs

2007-03-06, 6:57 pm

Janis wrote:

> The problem is that my data comes from several sources and I have just
> the one XML processor. If I'd want to cover all possible encoding
> settings in the data I'd have to parse the data to extract the string


That's right.

> to assign it to the builtin XML variable. (A quick try of that
> approach in xml-mode didn't lead me far, though. I'll have to consult
> the manual again whether it is possible with the xgawk features to
> simply access the encoding data within <?...?>,[*] which would be
> better than any separate preprocessing.


Yes, you can read the encoding with XMLgawk.

# The very first event holds the version info.
XMLDECLARATION {
version = XMLATTR["VERSION" ]
encoding = XMLATTR["ENCODING" ]
standalone = XMLATTR["STANDALONE"]
}

The problem is that at this point in time, you cannot
_change_ the variable XMLCHARSET anymore (to be precise,
you _can_ change it, but it will have no effect on the
XML data that you are currently reading). Peter Saveliev
has already stumbled across this problem (if I remember
correctly). But we havent changed XMLgawk to overcome
the problem.

The following workaround might be good enough for the moment.

xgawk -f my_script.awk my_xml_data.xml my_xml_data.xml

The idea is to read the file twice. The first run only
serves the purpose of properly reading the encoding name.
This name is copied to XMLCHARSET and the first run is
immediately terminated in order not to waste CPU time.
Now comes the second run through the file. Since XMLCHARSET
now has the "correct" value, you can start working with
the XML data as you wanted it to do.

> [*] If that's not possible would it be a helpful feature?


We could change XMLgawk to solve this problem.
But I doubt that we should do so.
Maybe Peter Saveliev can comment on this.
Janis Papanagnou

2007-03-06, 6:57 pm

Jürgen Kahrs wrote:
> Janis wrote:
>
>
> That's right.
>
>
> Yes, you can read the encoding with XMLgawk.
> [snip syntax]
> The problem is that at this point in time, you cannot
> _change_ the variable XMLCHARSET anymore (to be precise,
> you _can_ change it, but it will have no effect on the
> XML data that you are currently reading). Peter Saveliev
> has already stumbled across this problem (if I remember
> correctly). But we havent changed XMLgawk to overcome
> the problem.


Hmm.., what a pitty. It would really be a neat way (and it seems
to be also a coherent way) to handle that general problem.

> The following workaround might be good enough for the moment.
> [two-pass workaround]


I think I'll stay with the hard-coded approach, at the moment, as
long as our data does not break that.

>
> We could change XMLgawk to solve this problem.
> But I doubt that we should do so.


Hmm.., I certainly don't know the disturbing details of a change,
all pros and cons, but I'm somewhat astonished about your doubt.
Can you briefly elaborate a bit?

> Maybe Peter Saveliev can comment on this.


Or I'm looking forward to Peter's comments (but is he listening
on this newsgroup?)

Janis
Jürgen Kahrs

2007-03-06, 6:57 pm

Janis Papanagnou wrote:

>
> Hmm.., what a pitty. It would really be a neat way (and it seems
> to be also a coherent way) to handle that general problem.
>


Yes, but you have to remember that the problem
originates from a deficiency of MS Windows.
If MS Windows had proper support for locale,
then the problem would not exist. We cannot
streamline XMLgawk to overcome all quirks of
popular operating systems.


>
> Hmm.., I certainly don't know the disturbing details of a change,
> all pros and cons, but I'm somewhat astonished about your doubt.
> Can you briefly elaborate a bit?


The problem is that you know the encoding inside the
XML data only after starting to read the file. When
you have already started reading, you are not supposed
to change the fixed setting of a conversion method.
Allowing the user to change XMLCHARSET at any time
would make conversion much more difficult.

>
> Or I'm looking forward to Peter's comments (but is he listening
> on this newsgroup?)


I guess he is not a regular reader of comp.lang.awk.
But he has subscribed to our mailing list:

https://lists.sourceforge.net/lists...lgawk-developer
https://lists.sourceforge.net/lists...o/xmlgawk-users

If you also subscribe to one of the mailing lists,
you can address questions directly to all subscribers.
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com