For Programmers: Free Programming Magazines  


Home > Archive > AWK > January 2006 > GAWK oddity...









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author GAWK oddity...
Kenny McCormack

2005-12-17, 9:55 pm

Consider:

#!/usr/bin/gawk -f
BEGIN {
$0 = " 12\n 34\n 97\n"
RS = "";FS=/[ \n]+/;OFS="|" # This line!
$1=$1
print
}

The output is (the expected):

12|34|97

But, this is an odd way of doing things. It is more natural to set the
RS/FS/OFS variables at the top of the script, before data is assigned.

However, if we move the "This line!" up one line - that is, before the
assignment to $0, the output is:

12| 34| 97|

Which, I submit, is wrong.

Don Stokes

2005-12-17, 9:55 pm

In article <do2e4m$4k6$1@yin.interaccess.com>,
Kenny McCormack <gazelle@interaccess.com> wrote:
>Consider:
>
>#!/usr/bin/gawk -f
>BEGIN {
> $0 = " 12\n 34\n 97\n"
> RS = "";FS=/[ \n]+/;OFS="|" # This line!
> $1=$1
> print
> }
>
>The output is (the expected):
>
>12|34|97
>
>But, this is an odd way of doing things. It is more natural to set the
>RS/FS/OFS variables at the top of the script, before data is assigned.
>
>However, if we move the "This line!" up one line - that is, before the
>assignment to $0, the output is:
>
> 12| 34| 97|
>
>Which, I submit, is wrong.


What is wrong is:

FS=/[ \n]+/

which does not do what you think it does. That is, it searches $0 for
the regex "[ \n]+", and sets FS to 1 or 0 depending on whether the regex
was found or not.

In this particular example, the first "working" example sets FS=1,
*after* splitting the line using the default FS. The only reason your
example works at all is because "[ \n]" is pretty close to what the
default behaviour is when splitting $0. ("\n" is considered whitespace
along with the spaces.)

The RS is also superfluous unless you're reading from a file -- if you
assign to $0, the record is by definition the string you just assigned
and RS is not considered. Setting RS="" means that input records are
terminated by one or more blank lines; which may be what you finally
want, but isn't relevant to this example.

So, what you really wanted is:

#!/usr/bin/gawk -f
BEGIN {
FS = "[ \n]+" # Define the field separation
$0 = " 12\n 34\n 97\n" # Assign the record
OFS = "|" # Set OFS for output
$1 = $1 # Re-format $0 using the new OFS
print # Output the line
}

.... noting that unless the $0 string contains whitespace that you want
to keep other than spaces and line feeds, the FS = ... line is actually
unnecessary as the default behaviour does what you want.

-- don
Don Stokes

2005-12-17, 9:55 pm

Apologies for the self-followup and for being excessively pedantic ...

Don Stokes <don@daedalus.co.not-this-bit.nz> wrote:[color=darkred]
>Kenny McCormack <gazelle@interaccess.com> wrote:

Note too that this is not actually the expected output. The expected
output when splitting with "[ \n]" is:

|12|34|97|

as $0 starts and ends with whitespace, which would be matched by FS.
The default behaviour if FS is set to " " (a special case, not a regex)
is to ignore leading and trailing whitespace (including LFs, spaces,
tabs etc), and split fields on intervening strings of whitespace. For
example:

FS=" " ; $0 = " a b " -> NF=2, $1="a", $2="b"
FS=" +" ; $0 = " a b " -> NF=4, $1="", $2="a", $3="b", $4=""

The first line is the default behaviour; the second a regex. The regex
matches the leading and trailing whitespace, whereas the default special
case ignores leading and training spaces.

-- don
Kenny McCormack

2005-12-18, 7:55 am

In article <43a4d4e4$1@clear.net.nz>,
Don Stokes <don@daedalus.co.not-this-bit.nz> wrote:
....
>What is wrong is:
>
> FS=/[ \n]+/
>
>which does not do what you think it does. That is, it searches $0 for
>the regex "[ \n]+", and sets FS to 1 or 0 depending on whether the regex
>was found or not.


Ah yes. Comments:
1) My reason for posting wasn't to get help with any particular
problem, but rather to illustrate a point I've made several
times through the years, which is that messing with the
built-in variables is dangerous and should be avoided if
possible. There are lots of built-in funny gotchas and
cross-effects.
2) This is what I get for using gawk. I'm used to using TAWK,
where regexps are first class data types, and this stuff doesn't
happen. Don't get me wrong; gawk is "best in class" - that is,
it's the best you can do among the freely available AWKs, but
it does get a few things wrong (vis a vie, TAWK).
....
>The RS is also superfluous unless you're reading from a file -- if you
>assign to $0, the record is by definition the string you just assigned
>and RS is not considered. Setting RS="" means that input records are
>terminated by one or more blank lines; which may be what you finally
>want, but isn't relevant to this example.


Obviously, this was a cooked example to illustrate the issue. In real
life, RS is relevant, and, of course, I left it in in the example, in case
there was any cross effects.

Anyway, thanks for jogging my memory about these two points.

cumin

2005-12-18, 6:56 pm


Kenny McCormack wrote:
> In article <43a4d4e4$1@clear.net.nz>,
> Don Stokes <don@daedalus.co.not-this-bit.nz> wrote:
> ...
>
> Ah yes. Comments:
> 1) My reason for posting wasn't to get help with any particular
> problem, but rather to illustrate a point I've made several
> times through the years, which is that messing with the
> built-in variables is dangerous and should be avoided if
> possible. There are lots of built-in funny gotchas and
> cross-effects.
> 2) This is what I get for using gawk. I'm used to using TAWK,
> where regexps are first class data types, and this stuff doesn't
> happen. Don't get me wrong; gawk is "best in class" - that is,
> it's the best you can do among the freely available AWKs, but
> it does get a few things wrong (vis a vie, TAWK).
> ...
>
> Obviously, this was a cooked example to illustrate the issue. In real
> life, RS is relevant, and, of course, I left it in in the example, in case
> there was any cross effects.
>
> Anyway, thanks for jogging my memory about these two points.


A quick follow-up question:

In this context, what does it mean (and what is the significance) that
in TAWK regexp's are first class data types? I take it to mean that you
can create a named variable of that type, which can be passed around,
but I am guessing.

Thanks.

Don Stokes

2005-12-18, 6:56 pm

Kenny McCormack <gazelle@interaccess.com> wrote:
> 2) This is what I get for using gawk. I'm used to using TAWK,
> where regexps are first class data types, and this stuff doesn't
> happen. Don't get me wrong; gawk is "best in class" - that is,
> it's the best you can do among the freely available AWKs, but
> it does get a few things wrong (vis a vie, TAWK).


Right. There's no such thing in awk (as opposed to TAWK) as a regex
datatype. Rather, [<string> [!]~] /<regex>/ is a boolean expression,
and an assignment to that will get the result of that expression, i.e. 0
or 1.

The /<regex>/ syntax tells the parser three things: (a) it signals that a
regex comparison is to be done (if the parser hasn't already worked that
out from a ~ operator); and (b) it allows the parser to recognise that it
is a regex that can be pre-compiled. Less importantly, (c) it allows
subtly different syntax with regard to quoting, e.g. /\x/ is equivalent
to "\\x".

(b) means that less work needs to be done by not parsing the regex every
time. I haven't looked at how the gawk code actually does this (or even
if indeed it does), but quick tests indicate that comparisons using
<string> ~ /<regex>/ are about 10% faster than <string> ~ "<regex>", at
least for short strings and regexes.

What is a little confusing is that some built-in functions, e.g.
split(), sub(), match() & friends, treat the /<regex>/ syntax as kind of
regex literal.

And to be particularly confusing, split(s, a, /./) behaves differently
to split(s, a, "."). In the former, the single-character string "."
is looked for, while in the latter example, /./ is (not terribly
usefully) treated as a regex matching every character. A two or more
character quoted string (or string variable) is treated as a regex.
Kinda useful, but, well, surprising ...

-- don
Kenny McCormack

2006-01-10, 3:58 am

In article <1134922624.962232.105680@g49g2000cwa.googlegroups.com>,
cumin <jkilbourne@gmail.com> wrote:
....
>A quick follow-up question:
>
>In this context, what does it mean (and what is the significance) that
>in TAWK regexp's are first class data types? I take it to mean that you
>can create a named variable of that type, which can be passed around,
>but I am guessing.


Basically, yes. Basically, it means that you can do:

x = /foo/

and have it do the intuitive (albeit, non-POSIX) thing - which is to
assign the regexp to the variable x.

It also means you can pass regexps to user defined functions, like this:

function f(x) { print "typeof(x) =",typeof(x) }
BEGIN { f(/foo/) }

will print out "regular_expression". (Yes, typeof() is also super-standard)

As Don mentions in another post, POSIX AWKs have an ambiguity in that
sometimes when you pass a regexp to a (built-in) function, it is passed as
a reg exp (e.g., match(), sub(), gsub(), etc), but most of the time, a bare
regexp has an integer type (i.e., a value of 0 or 1). This never happens
in TAWK.

However, TAWK has an ambiguity in that a bare regexp still has to mean what
it usually means if it appears outside of curly braces (i.e., in the
"pattern" space). So:

/foo/ {print "Foo!"} # still works as expected.

Here are a couple of edge cases in TAWK - both of which do behave sensibly,
although they are are potentially ambiguous:

r = /foo/ { print "r =",r } # 4

and

function f(x) { print "typeof(x) =",typeof(x);return x } # regular_expression
r = f(/foo/) { print "r =",r } # foo

Harlan Grove

2006-01-10, 3:58 am

Don Stokes wrote...
>Kenny McCormack <gazelle@interaccess.com> wrote:
>
>Right. There's no such thing in awk (as opposed to TAWK) as a regex
>datatype. Rather, [<string> [!]~] /<regex>/ is a boolean expression,
>and an assignment to that will get the result of that expression, i.e. 0
>or 1.

....

How tawk would handle

/x/ { print foo $0 bar }

?

The standard way (per the grammar in TAPL) would be as equivalent to

$0 ~ /x/ { print foo $0 bar }

If tawk does so, then /x/ alone would be a boolean-valued expression.
But when used as all of the right hand side of an assignment, it
becomes a different data type, a precompiled regular expression object.
Begs the question whether different semantics for the same token in
different contexts is a good thing. Also begs the question whether it
wouldn't have been more in keeping with awk tradition to have used a
function for this, e.g.,

re = __TAWK_COMPILE_RE(/x/)

which would treat the regex literal as a function argument differently
than as the value of the implicit expression $0 ~ /x/, just like split,
match and [g]sub. Or maybe even an additional, nonstandard pattern like
__TAWK_COMPILE_TIME, in which expressions that could be created at
compile time and should remain constant throughout runtime could be
initialized. I'm guessing tawk would either choke on

{ if ($1 < $2) r = /abc/; else r = /xyz/; print $3 ~ r }

or would store all regex literals and use pointers to each when
assigning to variables.

But tawk is what it is (and it seems will remain so for eternity).

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com