Home > Archive > AWK > October 2006 > getline
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
|
|
| Ed Morton 2006-10-17, 6:55 pm |
| getline misuse comes up frequently in this NG and I don't know of any
one place that succinctly documents the dos and don'ts of getline, so
I've tried to capture it below with a view to getting it added to some
FAQ somewhere some time (or at least having a NG posting that people can
refer to). Comments?
Ed.
-----------------------------------------------------------------
The book Effective Awk Programming, Third Edition By Arnold Robbins
(http://www.oreilly.com/catalog/awkprog3) provides much of the
source for this discussion of getline.
Variants
--------
The following summarises the eight variants of getline applications,
listing which variables are set by each one:
Variant Variables Set
getline $0, ${1-NF}, NF, FNR, NR, FILENAME
getline var var, FNR, NR, FILENAME
getline < file $0, ${1-NF}, NF
getline var < file var
command | getline $0, ${1-NF}, NF
command | getline var var
command |& getline $0, ${1-NF}, NF
command |& getline var var
The "command |& ..." variants are GNU awk (gawk) extensions. gawk also
populates the ERRNO builtin variable if getline fails.
Although calling getline is very rarely the right approach (see
below), if you need to do it the safest ways to invoke getline are:
if/while ( (getline var < file) > 0)
if/while ( (command | getline var) > 0)
if/while ( (command |& getline var) > 0)
since those do not affect any of the builtin variables and they allow
you to correctly test for getline succeeding or failing. If you
need the input record split into separate fields, just call "split()"
to do that.
Gotchas
-------
Normally FILENAME is not set within a BEGIN section, but a
non-redirected call to getline will set it.
Calling "getline < FILENAME" is NOT the same as calling "getline".
The second form will read the next record from FILENAME while
the first form will read the first record again.
Calling getline without a var to be set will update $0 and $NF so
they will have a different value for subsequent processing than they
had for prior processing in the same condition/action block.
Many of the getline variants above set some but not all of the
builtin variables, so you need to be very careful that it's
setting the ones you need/expect it to.
According to POSIX, `getline < expression' is ambiguous if
expression contains unparenthesized operators other than `$'; for
example, `getline < dir "/" file' is ambiguous because the
concatenation operator is not parenthesized. You should write it
as `getline < (dir "/" file)' if you want your program to be
portable to other awk implementations.
In POSIX-compliant awks (e.g. gawk --posix) a failure of getline
(e.g. trying to read from a non-readable file) will be fatal to
the program, otherwise it won't.
Applications
------------
getline is an appropriate solution for the following:
a) Reading from a pipe:
command = "ls"
while ( (command | getline var) > 0) {
print var
}
close(command)
b) Reading from a coprocess, e.g.:
command = "LC_ALL=C sort"
n = split("abcdefghijklmnopqrstuvwxyz", a, "")
for (i = n; i > 0; i--)
print a[i] |& command
close(command, "to")
while ((command |& getline var) > 0)
print "got", var
close(command)
c) In the BEGIN section, reading some initial data that's referenced
during processing multiple subsequent input files, e.g.:
BEGIN {
while ( (getline var < ARGV[1]) > 0) {
data[var]++
}
close(ARGV[1])
ARGV[1]=""
}
$0 in data
In all other cases, it's clearest, simplest, less error-prone, and
easiest to maintain to let awks normal text-processing read the records.
In the case of "c", whether to use the BEGIN+getline approach or just
collect the data within the awk condition/action part after
testing for the first file is largely a style choice.
"a" above calls the UNIX command "ls" to list the current directory
contents, then prints the result one line at a time.
"b" above writes the letters of the alphabet in reverse order, one per
line, down the two-way pipe to the UNIX "sort" command. It then closes
the write end of the pipe, so that sort receives an end-of-file
indication. This causes sort to sort the data and write the sorted
data back to the gawk program. Once all of the data has been read,
gawk terminates the coprocess and exits. This is particularly necessary
in order to use the UNIX "sort" utility as part of a coprocess since
sort must read all of its input data before it can produce any output.
The sort program does not receive an end-of-file indication until gawk
closes the write end of the pipe. Other programs can be invoked as just:
command = "program"
do {
print data |& command
command |& getline var
} while (data left to process)
close(command)
"c" above reads every record of the first file passed as an argument to
awk into an array and then for every subsequent file passed as an
argument will print every record from that file that matches any of
the records that appeared in the first file (and so are stored in the
"data" array). This could alternatively have been implemented as:
# fails if first file is empty
NR==FNR{ data[$0]++; next }
0 in data
or:
FILENAME==ARGV[1] { data[$0]++; next }
$0 in data
or:
FILENAME=="" { data[$0]++; next }
$0 in data
or (gawk only):
ARGIND==1 { data[$0]++; next }
$0 in data
| |
| Ed Morton 2006-10-17, 6:55 pm |
| Ed Morton wrote:
<snip>
> FILENAME=="" { data[$0]++; next }
FILENAME=="<specific name>" { data[$0]++; next }
| |
| Steve Calfee 2006-10-17, 6:55 pm |
| On Tue, 17 Oct 2006 11:59:25 -0500, Ed Morton <morton@lsupcaemnt.com>
wrote:
>getline misuse comes up frequently in this NG and I don't know of any
>one place that succinctly documents the dos and don'ts of getline, so
>I've tried to capture it below with a view to getting it added to some
>FAQ somewhere some time (or at least having a NG posting that people can
>refer to). Comments?
>
> Ed.
>
Hi Ed,
Really nice summary. 90% of the problem is realizing that getline is a
problem.
I really don't have much to add. I do note that sometimes someone
doing a little awk app for someone else might not want that someone
else to mistype the filenames. I think in this case, getline is not
the solution, but having the awk program manipulate the argv array is
better. Doing something fool proof is hard, because the fools keep
getting more clever.
Regards, Steve
There is no "x" in my email address.
| |
| Kenny McCormack 2006-10-17, 6:55 pm |
| In article <iboaj2ljt46pv0c463qe3bkicvjil306f1@4ax.com>,
Steve Calfee <stevecalfee@hotmail.com> wrote:
>On Tue, 17 Oct 2006 11:59:25 -0500, Ed Morton <morton@lsupcaemnt.com>
>wrote:
>
>Hi Ed,
>
>Really nice summary. 90% of the problem is realizing that getline is a
>problem.
>
>I really don't have much to add. I do note that sometimes someone
>doing a little awk app for someone else might not want that someone
>else to mistype the filenames. I think in this case, getline is not
>the solution, but having the awk program manipulate the argv array is
>better. Doing something fool proof is hard, because the fools keep
>getting more clever.
My feeling about this situation as well as several related ones is that
you always have to provide a wrapper (batch file in DOS, shell script in
Unix, etc) so that the end user can just "run it" (nowadays, "just
double click it").
| |
| Anton Treuenfels 2006-10-18, 3:56 am |
|
"Ed Morton" <morton@lsupcaemnt.com> wrote in message > ------------
*snip*
> getline is an appropriate solution for the following:
*snip*
> In all other cases, it's clearest, simplest, less error-prone, and
I dunno - AWK (more specifically TAWK) doesn't let me do the equivalent of
"including" a file. That is, there's no built-in way to say "stop processing
the current file, begin processing another file, and when you're done with
it, return to the first file at the point you left it".
However I can write functions to do that using getline (with other
functions, of course), even nesting "include" to an arbitrary depth. When I
do this it is usually an integral capability of the program I'm writing. In
these cases all control is in the BEGIN block, and even the very first file
is read strictly using getline.
A perversion of AWK's design? Perhaps, but what other language makes it so
easy to manipulate those twisty little text lines once I acquire them, by
whatever means I can manage?
- Anton Treuenfels
PS. Your comments about which built-in variables get set makes me wonder if
FILENAME, which I had always assumed was unset by my method, is in fact set.
I shall have to check this!
| |
| Kenny McCormack 2006-10-18, 3:56 am |
| In article <hniZg.10943$Y24.6376@newsread4.news.pas.earthlink.net>,
Anton Treuenfels <atreuenfels@earthlink.net> wrote:
>
>"Ed Morton" <morton@lsupcaemnt.com> wrote in message > ------------
>
>*snip*
>
>
>*snip*
>
>
>I dunno - AWK (more specifically TAWK) doesn't let me do the equivalent of
>"including" a file. That is, there's no built-in way to say "stop processing
>the current file, begin processing another file, and when you're done with
>it, return to the first file at the point you left it".
Yes. When you absolutely, positively have to process 2 files in
parallel (e.g., if you are doing a line-by-line compare between two
files [*]), then you do have to use getline. This is one of the rare
exceptions to Ed's rule(s).
[*] Something I once did in AWK as a way of comparing two very large,
sorted files, at a time when "diff" had a 2500 line limit.
| |
|
| Ed Morton wrote:
>
> getline misuse comes up frequently in this NG and I don't know of any
> one place that succinctly documents the dos and don'ts of getline, so
> I've tried to capture it below with a view to getting it added to some
> FAQ somewhere some time (or at least having a NG posting that people can
> refer to). Comments?
It's a good idea to make that valuable compilation publicly available.
If it will just be the posted version (as opposed to a FAQ) then I'd
suggest to add more keywords in the posting title ("getline caveats
misuse howto ..." ???) to find and identify your article easier.
> getline $0, ${1-NF}, NF, FNR, NR, FILENAME
> getline < file $0, ${1-NF}, NF
> command | getline $0, ${1-NF}, NF
> Although calling getline is very rarely the right approach (see
> below), if you need to do it the safest ways to invoke getline are:
> [...]
> since those do not affect any of the builtin variables and they allow
> you to correctly test for getline succeeding or failing.
Personally the above "unsafe" variants have never been a problem to
me; changing those builtin variables is the behaviour that I expect.
What I *do* find irritating (and which you might want to add to your
compilation) is the behaviour of...
| $ gawk 'BEGIN {getline v1 v2; print v1; print v2; print NF}'
| some vars all in v1
|
| 0
Gawk (not sure about other awk's) does not complain about the syntax!
(v1 v2 are lvalues so there's no need to assume concatenation syntax.)
Especially if you also do shell programming and think/expect getline
to behave like a 'read v1 v2 v_rest' it might be at least confusing.
(BTW, at a first glance I mistook "${1-NF}" for a shell default var
expansion; might $1-$NF be clearer?)
> Calling "getline < FILENAME" is NOT the same as calling "getline".
> The second form will read the next record from FILENAME while
> the first form will read the first record again.
Uh-oh! Side-effect alarm. I'd never mix getline with FILENAME.
> b) Reading from a coprocess, e.g.:
>
> close(command, "to")
Gawk specific close() argument.
> [...] It then closes
> the write end of the pipe, so that sort receives an end-of-file
> indication.
> [...] This could alternatively have been implemented as:
>
> # fails if first file is empty
> NR==FNR{ data[$0]++; next }
> 0 in data
Typo: $0 in data
Janis
| |
| Kenny McCormack 2006-10-18, 7:55 am |
| In article <1161165933.528178.108170@h48g2000cwc.googlegroups.com>,
Janis <janis_papanagnou@hotmail.com> wrote:
....
>What I *do* find irritating (and which you might want to add to your
>compilation) is the behaviour of...
>
>| $ gawk 'BEGIN {getline v1 v2; print v1; print v2; print NF}'
>| some vars all in v1
>|
>| 0
>
>Gawk (not sure about other awk's) does not complain about the syntax!
>(v1 v2 are lvalues so there's no need to assume concatenation syntax.)
You do, of course, realize how that gets parsed and why it is perfectly
legal AWK syntax. Hint: It is one of the many perils of having an
implicit, rather than explicit string concatenation operator. One of
the (few) weaknesses of AWK.
>Especially if you also do shell programming and think/expect getline
>to behave like a 'read v1 v2 v_rest' it might be at least confusing.
There's really no point in doing (non-trivial) shell programming, once
you've mastered AWK. I've used the shell's "read" command about twice
in 20 years.
| |
| Thomas Weidenfeller 2006-10-18, 7:55 am |
| Janis wrote:
> (BTW, at a first glance I mistook "${1-NF}" for a shell default var
> expansion; might $1-$NF be clearer?)
<nitpick>
Ages ago my engineering professors insisted on never using '-' to
indicate ranges, because it is to easy to mistake it for arithmetic or a
negative sign. '...' is the way to indicate ranges.
</nitpick>
/Thomas
| |
| Thomas Weidenfeller 2006-10-18, 7:55 am |
| Ed Morton wrote:
> Although calling getline is very rarely the right approach (see
> below), if you need to do it the safest ways to invoke getline are:
>
> if/while ( (getline var < file) > 0)
> if/while ( (command | getline var) > 0)
> if/while ( (command |& getline var) > 0)
>
> since those do not affect any of the builtin variables and they allow
> you to correctly test for getline succeeding or failing. If you
> need the input record split into separate fields, just call "split()"
> to do that.
A minor addition:
I didn't look it up if it is sanctioned by POSIX, but if you need to
distinguish between a normal EOF or some read or opening error, you end
up with something like:
if/while ( (e = (getline var < file)) > 0) { ... }
close(file)
if(e < 0) some_error_handling
which is not pretty at all.
Forgetting the close() is also a gotcha I have seen more than once.
/Thomas
| |
|
| Kenny McCormack wrote:
>
> In article <1161165933.528178.108170@h48g2000cwc.googlegroups.com>,
> Janis <janis_papanagnou@hotmail.com> wrote:
> ...
>
> You do, of course, realize how that gets parsed and why it is perfectly
> legal AWK syntax. Hint: It is one of the many perils of having an
> implicit, rather than explicit string concatenation operator. One of
> the (few) weaknesses of AWK.
I think there's no confusion about the concatenation, because the
concatenation rquires "r-values", not "l-values" as in this specific
context. So the parser *could* detect that as inappropriate syntax.
>
> There's really no point in doing (non-trivial) shell programming, once
> you've mastered AWK. I've used the shell's "read" command about twice
> in 20 years.
Well, using the shells 'read' is rather trivial, IMO, though it indeed
has its own caveats (off-topic here); but I use it regularily whenever
I need it.
Janis
| |
|
| Thomas Weidenfeller wrote:
>
> Janis wrote:
>
> <nitpick>
> Ages ago my engineering professors insisted on never using '-' to
> indicate ranges, because it is to easy to mistake it for arithmetic or a
> negative sign. '...' is the way to indicate ranges.
> </nitpick>
Agreed. And personally I use also the elipsis for that purpose.
But that's not my publication, but Ed's; and I wouldn't want to
give advice on style preferences. Especially since I've observed
that in some (many?) parts of the world it still seems to be the
prefered and common way to write ranges.
Janis
| |
| Kenny McCormack 2006-10-18, 7:55 am |
| In article <1161173190.817303.113430@m73g2000cwd.googlegroups.com>,
Janis <janis_papanagnou@hotmail.com> wrote:
>Kenny McCormack wrote:
>
>I think there's no confusion about the concatenation, because the
>concatenation rquires "r-values", not "l-values" as in this specific
>context. So the parser *could* detect that as inappropriate syntax.
Hmm. Maybe I assumed too much.
The point is that *any* getline command is, in fact, an arithmetic
expression - i.e., an r-value. It is *not* inappropriate syntax.
| |
|
| Kenny McCormack wrote:
>
> The point is that *any* getline command is, in fact, an arithmetic
> expression - i.e., an r-value. It is *not* inappropriate syntax.
You are right. It's explainable. I have missed that in the given
usage context. Thank's for pointing it out.
Janis
| |
| Thomas Weidenfeller 2006-10-18, 7:55 am |
| Janis wrote:
> But that's not my publication, but Ed's; and I wouldn't want to
> give advice on style preferences.
Come on, he asked for feedback, he got it :-)
/Thomas
| |
| Manuel Collado 2006-10-18, 7:55 am |
| Janis escribió:
> Kenny McCormack wrote:
>
>
> I think there's no confusion about the concatenation, because the
> concatenation rquires "r-values", not "l-values" as in this specific
> context. So the parser *could* detect that as inappropriate syntax.
It is parsed as
{ (getline v1) v2; ...
So there is an implicit concatenation. It is also legal to write
{ getline 3; ...
that is parsed as
{ (getline) 3; ...
also assuming a concatenation.
--
To reply by e-mail, please remove the extra dot
in the given address: m.collado -> mcollado
| |
| Manuel Collado 2006-10-18, 7:55 am |
| Janis escribió:
> Thomas Weidenfeller wrote:
>
>
> Agreed. And personally I use also the elipsis for that purpose.
> But that's not my publication, but Ed's; and I wouldn't want to
> give advice on style preferences. Especially since I've observed
> that in some (many?) parts of the world it still seems to be the
> prefered and common way to write ranges.
As in REs: [a-zA-Z]
So this is the AWK way to write ranges :-|
--
To reply by e-mail, please remove the extra dot
in the given address: m.collado -> mcollado
| |
| Ed Morton 2006-10-19, 7:55 am |
| Ed Morton wrote:
> getline misuse comes up frequently in this NG and I don't know of any
> one place that succinctly documents the dos and don'ts of getline, so
> I've tried to capture it below with a view to getting it added to some
> FAQ somewhere some time (or at least having a NG posting that people can
> refer to). Comments?
Thanks for all the comments, I'll post an updated version in the next
day or two.
Ed.
|
|
|
|
|