For Programmers: Free Programming Magazines  


Home > Archive > AWK > December 2005 > strange bug in gawk









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author strange bug in gawk
Rolf Sander

2005-12-01, 6:59 pm

Hello,

I think I found a strange bug that occurs in some but not all gawk
versions. I even get different results when using gawk 3.1.3 on
different linux installations (SuSE vs. Fedora). Here is a small
example to test this bug:

gawk 'BEGIN {print match("X","[^a-z]")}'

The correct answer should be 1 but some gawks produce 0.

I am trying to find out the reason for this strange behaviour and I
would like to have more test results. Maybe you can help me:

What answer do you get?
Which version of gawk do you use (type: "gawk --version")
What is your operating system?

Any ideas what is causing this problem? I suspect that it has something
to do with a library included into gawk that works on regular
expressions...

Lars Kellogg-Stedman

2005-12-01, 7:00 pm

> I think I found a strange bug that occurs in some but not all gawk
> versions.


I don't think you've found a bug in gawk. I think you're running into
the fact that different locale settings use different character
collation rules.

For example, try this:

$ LC_COLLATE=C gawk 'BEGIN {print match("X","[^a-z]")}'
1
$ LC_COLLATE=en_US gawk 'BEGIN {print match("X","[^a-z]")}'
0

I'll bet if you look at the systems on which you saw different behavior
you'll find that the locale (usually $LANG environment variable)
settings are different.

For more information, you'll want to read up on locale and
internationalization -- there's a chapter in the Gawk documentation on
this, but I don't think it's terribly informative w/r/t this issue.
The following may be helpful:

# The 'Locale' section from "The Single Unix Specification":
http://lookleap.com/opengroup.org/a1

# Perl's locale documentation.
http://lookleap.com/perl.com/a1

# "Understanding locale environment variables"
http://lookleap.com/publib16.boulder.ibm.com/a1

-- Lars

--
Lars Kellogg-Stedman <8273grkci8q8kgt@jetable.net>
This email address will expire on 2005-11-23.

Rolf Sander

2005-12-01, 7:00 pm

Hello Lars,

Thanks for your reply and sorry if my question was a FAQ in this group.
I tried setting LC_COLLATE and it seems that this explains my problem
partially but not completely.
The following tests were all done on my SuSE9.1 linux pc. First with
gawk in /usr/bin and then using executables that I have copied from a
Fedora linux and from SuSE9.2.
After setting "LC_COLLATE=C" I always get the result that I want.
However, setting
"LC_COLLATE=en_US" does not always yield the result 0 which should be
expected according what you told me:

linux gawk LC_COLLATE
system version C en_US

SuSE9.1 3.1.3 1 1
Fedora 3.1.3 1 0
SuSE9.2 3.1.4 1 1

Is it possible that the SuSE gawks ignore LC_COLLATE=en_US ?

Lars Kellogg-Stedman

2005-12-01, 7:00 pm

On 2005-12-01, Rolf Sander <sander@mpch-mainz.mpg.de> wrote:
> Is it possible that the SuSE gawks ignore LC_COLLATE=en_US ?


I don't have any experience with SuSE, so I can't give you a definitive
answer. Does your SuSE system have locale files in
'/usr/lib/locale/en_US'? Or elsewhere, if not there?

I suppose it's possible to build gawk without locale support. Hopefully
there's a SuSE person around who can give you a better answer.

-- Lars

--
Lars Kellogg-Stedman <8273grkci8q8kgt@jetable.net>
This email address will expire on 2005-11-23.

Rolf Sander

2005-12-01, 7:00 pm

Hello,

> Does your SuSE system have locale files in '/usr/lib/locale/en_US'?


Yes.

> I suppose it's possible to build gawk without locale support. Hopefully
> there's a SuSE person around who can give you a better answer.


I made a few more tests: The SuSE9.2 gawk executable on my SuSE9.1
computer always produces "1" as a result of my test, no matter how
LC_COLLATE is set. However, using the SuSE9.2 gawk executable on a
SuSE9.2 system (on the computer of my office roommate), then the result
depends on LC_COLLATE as expected. Maybe I just shouldn't use the
SuSE9.2 executable on my SuSE9.1 computer...

Janis Papanagnou

2005-12-01, 7:00 pm

Rolf Sander wrote:
> Hello Lars,
>
> Thanks for your reply and sorry if my question was a FAQ in this group.
> I tried setting LC_COLLATE and it seems that this explains my problem
> partially but not completely.
> The following tests were all done on my SuSE9.1 linux pc. First with
> gawk in /usr/bin and then using executables that I have copied from a
> Fedora linux and from SuSE9.2.
> After setting "LC_COLLATE=C" I always get the result that I want.
> However, setting
> "LC_COLLATE=en_US" does not always yield the result 0 which should be
> expected according what you told me:
>
> linux gawk LC_COLLATE
> system version C en_US
>
> SuSE9.1 3.1.3 1 1
> Fedora 3.1.3 1 0
> SuSE9.2 3.1.4 1 1
>
> Is it possible that the SuSE gawks ignore LC_COLLATE=en_US ?
>


This is on a SuSE 8.2, but maybe more important...
% awk --version
GNU Awk 3.1.1

% echo $LC_COLLATE
POSIX
% gawk 'BEGIN {print match("X","[^a-z]")}'
1
% LC_COLLATE= gawk 'BEGIN {print match("X","[^a-z]")}'
0
% LC_COLLATE=C gawk 'BEGIN {print match("X","[^a-z]")}'
1
% LC_COLLATE=en_US gawk 'BEGIN {print match("X","[^a-z]">
0

There are several locale and language environment variables that _might_
influence the output. With reference to your posting downthread, list all
the environment variables on your and on your roommates system and compare
the LANG, LANGUAGE, and all the LC_* variables; it may give you some hint.

Janis
Rolf Sander

2005-12-07, 3:56 am

Thanks for all your help! There are still some mysteries to me, but I
understand the basic message:

It's not a bug, it's a feature!

In my scripts, I will now precede every call to gawk by setting
LC_COLLATE=C. I think this way they will become more portable.

Kenny McCormack

2005-12-07, 7:56 am

In article <1133948368.955740.83150@g49g2000cwa.googlegroups.com>,
Rolf Sander <sander@mpch-mainz.mpg.de> wrote:
>Thanks for all your help! There are still some mysteries to me, but I
>understand the basic message:
>
>It's not a bug, it's a feature!
>
>In my scripts, I will now precede every call to gawk by setting
>LC_COLLATE=C. I think this way they will become more portable.


There's a lot to be said for putting: LC_ALL=C
in /etc/profile, and then never having to worry about this nonsense again.

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com