Home > Archive > AWK > January 2006 > GAWK oddity...
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
|
|
| Kenny McCormack 2005-12-17, 9:55 pm |
| Consider:
#!/usr/bin/gawk -f
BEGIN {
$0 = " 12\n 34\n 97\n"
RS = "";FS=/[ \n]+/;OFS="|" # This line!
$1=$1
print
}
The output is (the expected):
12|34|97
But, this is an odd way of doing things. It is more natural to set the
RS/FS/OFS variables at the top of the script, before data is assigned.
However, if we move the "This line!" up one line - that is, before the
assignment to $0, the output is:
12| 34| 97|
Which, I submit, is wrong.
| |
| Don Stokes 2005-12-17, 9:55 pm |
| In article <do2e4m$4k6$1@yin.interaccess.com>,
Kenny McCormack <gazelle@interaccess.com> wrote:
>Consider:
>
>#!/usr/bin/gawk -f
>BEGIN {
> $0 = " 12\n 34\n 97\n"
> RS = "";FS=/[ \n]+/;OFS="|" # This line!
> $1=$1
> print
> }
>
>The output is (the expected):
>
>12|34|97
>
>But, this is an odd way of doing things. It is more natural to set the
>RS/FS/OFS variables at the top of the script, before data is assigned.
>
>However, if we move the "This line!" up one line - that is, before the
>assignment to $0, the output is:
>
> 12| 34| 97|
>
>Which, I submit, is wrong.
What is wrong is:
FS=/[ \n]+/
which does not do what you think it does. That is, it searches $0 for
the regex "[ \n]+", and sets FS to 1 or 0 depending on whether the regex
was found or not.
In this particular example, the first "working" example sets FS=1,
*after* splitting the line using the default FS. The only reason your
example works at all is because "[ \n]" is pretty close to what the
default behaviour is when splitting $0. ("\n" is considered whitespace
along with the spaces.)
The RS is also superfluous unless you're reading from a file -- if you
assign to $0, the record is by definition the string you just assigned
and RS is not considered. Setting RS="" means that input records are
terminated by one or more blank lines; which may be what you finally
want, but isn't relevant to this example.
So, what you really wanted is:
#!/usr/bin/gawk -f
BEGIN {
FS = "[ \n]+" # Define the field separation
$0 = " 12\n 34\n 97\n" # Assign the record
OFS = "|" # Set OFS for output
$1 = $1 # Re-format $0 using the new OFS
print # Output the line
}
.... noting that unless the $0 string contains whitespace that you want
to keep other than spaces and line feeds, the FS = ... line is actually
unnecessary as the default behaviour does what you want.
-- don
| |
| Don Stokes 2005-12-17, 9:55 pm |
| Apologies for the self-followup and for being excessively pedantic ...
Don Stokes <don@daedalus.co.not-this-bit.nz> wrote:[color=darkred]
>Kenny McCormack <gazelle@interaccess.com> wrote:
Note too that this is not actually the expected output. The expected
output when splitting with "[ \n]" is:
|12|34|97|
as $0 starts and ends with whitespace, which would be matched by FS.
The default behaviour if FS is set to " " (a special case, not a regex)
is to ignore leading and trailing whitespace (including LFs, spaces,
tabs etc), and split fields on intervening strings of whitespace. For
example:
FS=" " ; $0 = " a b " -> NF=2, $1="a", $2="b"
FS=" +" ; $0 = " a b " -> NF=4, $1="", $2="a", $3="b", $4=""
The first line is the default behaviour; the second a regex. The regex
matches the leading and trailing whitespace, whereas the default special
case ignores leading and training spaces.
-- don
| |
| Kenny McCormack 2005-12-18, 7:55 am |
| In article <43a4d4e4$1@clear.net.nz>,
Don Stokes <don@daedalus.co.not-this-bit.nz> wrote:
....
>What is wrong is:
>
> FS=/[ \n]+/
>
>which does not do what you think it does. That is, it searches $0 for
>the regex "[ \n]+", and sets FS to 1 or 0 depending on whether the regex
>was found or not.
Ah yes. Comments:
1) My reason for posting wasn't to get help with any particular
problem, but rather to illustrate a point I've made several
times through the years, which is that messing with the
built-in variables is dangerous and should be avoided if
possible. There are lots of built-in funny gotchas and
cross-effects.
2) This is what I get for using gawk. I'm used to using TAWK,
where regexps are first class data types, and this stuff doesn't
happen. Don't get me wrong; gawk is "best in class" - that is,
it's the best you can do among the freely available AWKs, but
it does get a few things wrong (vis a vie, TAWK).
....
>The RS is also superfluous unless you're reading from a file -- if you
>assign to $0, the record is by definition the string you just assigned
>and RS is not considered. Setting RS="" means that input records are
>terminated by one or more blank lines; which may be what you finally
>want, but isn't relevant to this example.
Obviously, this was a cooked example to illustrate the issue. In real
life, RS is relevant, and, of course, I left it in in the example, in case
there was any cross effects.
Anyway, thanks for jogging my memory about these two points.
| |
|
|
Kenny McCormack wrote:
> In article <43a4d4e4$1@clear.net.nz>,
> Don Stokes <don@daedalus.co.not-this-bit.nz> wrote:
> ...
>
> Ah yes. Comments:
> 1) My reason for posting wasn't to get help with any particular
> problem, but rather to illustrate a point I've made several
> times through the years, which is that messing with the
> built-in variables is dangerous and should be avoided if
> possible. There are lots of built-in funny gotchas and
> cross-effects.
> 2) This is what I get for using gawk. I'm used to using TAWK,
> where regexps are first class data types, and this stuff doesn't
> happen. Don't get me wrong; gawk is "best in class" - that is,
> it's the best you can do among the freely available AWKs, but
> it does get a few things wrong (vis a vie, TAWK).
> ...
>
> Obviously, this was a cooked example to illustrate the issue. In real
> life, RS is relevant, and, of course, I left it in in the example, in case
> there was any cross effects.
>
> Anyway, thanks for jogging my memory about these two points.
A quick follow-up question:
In this context, what does it mean (and what is the significance) that
in TAWK regexp's are first class data types? I take it to mean that you
can create a named variable of that type, which can be passed around,
but I am guessing.
Thanks.
| |
| Don Stokes 2005-12-18, 6:56 pm |
| Kenny McCormack <gazelle@interaccess.com> wrote:
> 2) This is what I get for using gawk. I'm used to using TAWK,
> where regexps are first class data types, and this stuff doesn't
> happen. Don't get me wrong; gawk is "best in class" - that is,
> it's the best you can do among the freely available AWKs, but
> it does get a few things wrong (vis a vie, TAWK).
Right. There's no such thing in awk (as opposed to TAWK) as a regex
datatype. Rather, [<string> [!]~] /<regex>/ is a boolean expression,
and an assignment to that will get the result of that expression, i.e. 0
or 1.
The /<regex>/ syntax tells the parser three things: (a) it signals that a
regex comparison is to be done (if the parser hasn't already worked that
out from a ~ operator); and (b) it allows the parser to recognise that it
is a regex that can be pre-compiled. Less importantly, (c) it allows
subtly different syntax with regard to quoting, e.g. /\x/ is equivalent
to "\\x".
(b) means that less work needs to be done by not parsing the regex every
time. I haven't looked at how the gawk code actually does this (or even
if indeed it does), but quick tests indicate that comparisons using
<string> ~ /<regex>/ are about 10% faster than <string> ~ "<regex>", at
least for short strings and regexes.
What is a little confusing is that some built-in functions, e.g.
split(), sub(), match() & friends, treat the /<regex>/ syntax as kind of
regex literal.
And to be particularly confusing, split(s, a, /./) behaves differently
to split(s, a, "."). In the former, the single-character string "."
is looked for, while in the latter example, /./ is (not terribly
usefully) treated as a regex matching every character. A two or more
character quoted string (or string variable) is treated as a regex.
Kinda useful, but, well, surprising ...
-- don
| |
| Kenny McCormack 2006-01-10, 3:58 am |
| In article <1134922624.962232.105680@g49g2000cwa.googlegroups.com>,
cumin <jkilbourne@gmail.com> wrote:
....
>A quick follow-up question:
>
>In this context, what does it mean (and what is the significance) that
>in TAWK regexp's are first class data types? I take it to mean that you
>can create a named variable of that type, which can be passed around,
>but I am guessing.
Basically, yes. Basically, it means that you can do:
x = /foo/
and have it do the intuitive (albeit, non-POSIX) thing - which is to
assign the regexp to the variable x.
It also means you can pass regexps to user defined functions, like this:
function f(x) { print "typeof(x) =",typeof(x) }
BEGIN { f(/foo/) }
will print out "regular_expression". (Yes, typeof() is also super-standard)
As Don mentions in another post, POSIX AWKs have an ambiguity in that
sometimes when you pass a regexp to a (built-in) function, it is passed as
a reg exp (e.g., match(), sub(), gsub(), etc), but most of the time, a bare
regexp has an integer type (i.e., a value of 0 or 1). This never happens
in TAWK.
However, TAWK has an ambiguity in that a bare regexp still has to mean what
it usually means if it appears outside of curly braces (i.e., in the
"pattern" space). So:
/foo/ {print "Foo!"} # still works as expected.
Here are a couple of edge cases in TAWK - both of which do behave sensibly,
although they are are potentially ambiguous:
r = /foo/ { print "r =",r } # 4
and
function f(x) { print "typeof(x) =",typeof(x);return x } # regular_expression
r = f(/foo/) { print "r =",r } # foo
| |
| Harlan Grove 2006-01-10, 3:58 am |
| Don Stokes wrote...
>Kenny McCormack <gazelle@interaccess.com> wrote:
>
>Right. There's no such thing in awk (as opposed to TAWK) as a regex
>datatype. Rather, [<string> [!]~] /<regex>/ is a boolean expression,
>and an assignment to that will get the result of that expression, i.e. 0
>or 1.
....
How tawk would handle
/x/ { print foo $0 bar }
?
The standard way (per the grammar in TAPL) would be as equivalent to
$0 ~ /x/ { print foo $0 bar }
If tawk does so, then /x/ alone would be a boolean-valued expression.
But when used as all of the right hand side of an assignment, it
becomes a different data type, a precompiled regular expression object.
Begs the question whether different semantics for the same token in
different contexts is a good thing. Also begs the question whether it
wouldn't have been more in keeping with awk tradition to have used a
function for this, e.g.,
re = __TAWK_COMPILE_RE(/x/)
which would treat the regex literal as a function argument differently
than as the value of the implicit expression $0 ~ /x/, just like split,
match and [g]sub. Or maybe even an additional, nonstandard pattern like
__TAWK_COMPILE_TIME, in which expressions that could be created at
compile time and should remain constant throughout runtime could be
initialized. I'm guessing tawk would either choke on
{ if ($1 < $2) r = /abc/; else r = /xyz/; print $3 ~ r }
or would store all regex literals and use pointers to each when
assigning to variables.
But tawk is what it is (and it seems will remain so for eternity).
|
|
|
|
|