For Programmers: Free Programming Magazines  


Home > Archive > AWK > October 2006 > working with multiple input files









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author working with multiple input files
durant

2006-10-16, 6:55 pm

i am processing 2 files, one is processed in the BEGIN section of the
gawk script and works great. i use a close statement to close this
file.

once in the main section i am trying to process the second file, but it
seems like the main section is not processing anything. i am unable to
get a simpe print statement to echo a message that the program has
reached that point.

i am running gawk in DOS by the way, not sure if that is causing the
headaches.

thanks
durant

Ed Morton

2006-10-16, 6:55 pm

durant wrote:
> i am processing 2 files, one is processed in the BEGIN section of the
> gawk script and works great. i use a close statement to close this
> file.
>
> once in the main section i am trying to process the second file, but it
> seems like the main section is not processing anything. i am unable to
> get a simpe print statement to echo a message that the program has
> reached that point.
>
> i am running gawk in DOS by the way, not sure if that is causing the
> headaches.
>
> thanks
> durant
>


Show us the smallest script you can write that produces this problem.

Ed.
durant

2006-10-16, 6:55 pm

here is what i am trying to do and cannot get "hello" to print in the
main section. plus i inculded sample input files if needed.

#!s:\rdg\unixutil\usr\local\wbin\gawk -f
BEGIN {
while (getline < "first_file.csv" > 0){
print $1 >"first_file.out";
n++;
}
close("first_file.csv");
print n;
}
{
print "hello";
#
while ( getline <"second_file.csv" > 0){
print $1,$2;
}
}

first_file.csv
UWI
42251301080000
42251301080000
42251301080000
42251301080000

second_file.csv
API,OPER
42121300750,2.99
42121300820,3
42121300880,4
42121300950,28.91
42121300970,92.00
42121301040,6


Ed Morton wrote:
> durant wrote:
>
> Show us the smallest script you can write that produces this problem.
>
> Ed.


Vassilis

2006-10-16, 6:55 pm

Please don't top post.

Ed Morton wrote:

durant wrote:[color=darkred]
> here is what i am trying to do and cannot get "hello" to print in the
> main section. plus i inculded sample input files if needed.
>
> #!s:\rdg\unixutil\usr\local\wbin\gawk -f
> BEGIN {
> while (getline < "first_file.csv" > 0){
> print $1 >"first_file.out";
> n++;
> }
> close("first_file.csv");
> print n;
> }
> {
> print "hello";
> #
> while ( getline <"second_file.csv" > 0){
> print $1,$2;
> }
> }
>
> first_file.csv
> UWI
> 42251301080000
> 42251301080000
> 42251301080000
> 42251301080000
>
> second_file.csv
> API,OPER
> 42121300750,2.99
> 42121300820,3
> 42121300880,4
> 42121300950,28.91
> 42121300970,92.00
> 42121301040,6


I think you're trying to execute your script as: gawk -f script.awk or
script.awk
That is, you don't specify any filenames and awk hangs on, waiting to
read from stdin.
I think you've missed the way awk does processing.
That's how you should write the script to process these files

currfile != FILENAME {
if (currfile != "") {
if (nfile == 1)
print FNR

close(currfile)
}

currfile = FILENAME
nfile++
}

nfile == 1 { print $1 > "first_file.out" }
nfile == 2 { print $1, $2 }

Call it with
script.awk first_file.csv second_file.csv

durant

2006-10-16, 6:55 pm

thanks for your comments, but it is my understanding that you can
dictate the filenames to read from, from within the awk routine. that
way i do not have to specify on the command line. if the filenames do
not change, seems to be more efficient to hardcode the names.

but i will specify on command line and make it work.



Vassilis wrote:
> Please don't top post.
>
> Ed Morton wrote:
>
> durant wrote:
>
> I think you're trying to execute your script as: gawk -f script.awk or
> script.awk
> That is, you don't specify any filenames and awk hangs on, waiting to
> read from stdin.
> I think you've missed the way awk does processing.
> That's how you should write the script to process these files
>
> currfile != FILENAME {
> if (currfile != "") {
> if (nfile == 1)
> print FNR
>
> close(currfile)
> }
>
> currfile = FILENAME
> nfile++
> }
>
> nfile == 1 { print $1 > "first_file.out" }
> nfile == 2 { print $1, $2 }
>
> Call it with
> script.awk first_file.csv second_file.csv


Ed Morton

2006-10-16, 6:55 pm

Vassilis wrote:

> Please don't top post.
>
> Ed Morton wrote:
>
>
>
> durant wrote:
>
>
>
> I think you're trying to execute your script as: gawk -f script.awk or
> script.awk
> That is, you don't specify any filenames and awk hangs on, waiting to
> read from stdin.
> I think you've missed the way awk does processing.
> That's how you should write the script to process these files
>
> currfile != FILENAME {
> if (currfile != "") {
> if (nfile == 1)
> print FNR
>
> close(currfile)
> }
>
> currfile = FILENAME
> nfile++
> }
>
> nfile == 1 { print $1 > "first_file.out" }
> nfile == 2 { print $1, $2 }
>
> Call it with
> script.awk first_file.csv second_file.csv
>


I think you're right the problem's almost certainly in how he's invoking
the script and he has definitely misunderstood how awk works. I'd have
written it as just:

NR == FNR { print $1 > "first_file.out"; n++; next }
FNR == 1 { print n ORS "hello" }
{ print $1,$2 }

and invoked it as you suggest. Since there's only 2 files, there doesn't
seem much point in closing the first one. You could use "ARGIND == 1"
(gawk only) or other ways of distinguishing the first from subsequent
files if you're worried about the first file being empty.

For the OP - "getline" has a ton of caveats and must not be used unless
you're totally familiar with all of them and have one of those VERY rare
sitations where getline is the right solution.

Ed.

Ed Morton

2006-10-16, 6:55 pm

durant wrote:

> thanks for your comments,


As Vassilis said, "Please don't top post.".

>
> Vassilis wrote:
>
>


but it is my understanding that you can
> dictate the filenames to read from, from within the awk routine. that
> way i do not have to specify on the command line. if the filenames do
> not change, seems to be more efficient to hardcode the names.


If you want to do that, tyou could do it this way:

BEGIN{ARGV[ARGC++]="first_file.csv";ARGV[ARGC++]="second_file.csv"}

> but i will specify on command line and make it work.


OK, just so long as you don't use getline.

Ed.
Rufus V. Smith

2006-10-17, 6:55 pm

You have excess brackets.

Your BEGIN clause is executing, of course.

But you are trying to read the second file outside of the begin
clause and in a "match all input lines" processing clause.

As someone else pointed out, you are probably executing your script
with no arguments, so it is waiting on STDIN.

If you are calling this from a batch file, the script will exit.

If you invoke it from the command line, it will wait for you to
enter something at the keyboard. If you enter ^z it will end, if
you enters some other line it will process the file.

You should process the second file within your begin clause as well,
for the behavior you want.

Rufus

"durant" <durant_greenwood@sbcglobal.net> wrote in message
news:1161032769.041423.158270@h48g2000cwc.googlegroups.com...
> here is what i am trying to do and cannot get "hello" to print in the
> main section. plus i inculded sample input files if needed.
>
> #!s:\rdg\unixutil\usr\local\wbin\gawk -f
> BEGIN {
> while (getline < "first_file.csv" > 0){
> print $1 >"first_file.out";
> n++;
> }
> close("first_file.csv");
> print n;
> }
> {
> print "hello";
> #
> while ( getline <"second_file.csv" > 0){
> print $1,$2;
> }
> }
>
> first_file.csv
> UWI
> 42251301080000
> 42251301080000
> 42251301080000
> 42251301080000
>
> second_file.csv
> API,OPER
> 42121300750,2.99
> 42121300820,3
> 42121300880,4
> 42121300950,28.91
> 42121300970,92.00
> 42121301040,6
>
>
> Ed Morton wrote:
>



Ed Morton

2006-10-17, 6:55 pm

Rufus V. Smith wrote:

Please don't top-post. Fixed below.

>
> "durant" <durant_greenwood@sbcglobal.net> wrote in message
> news:1161032769.041423.158270@h48g2000cwc.googlegroups.com...
>
>
> You have excess brackets.


No he doesn't. Maybe some of the greater-than or less-than symbols look
like brackets in your viewer. He does, however, have his bracketing
incorrect for using getline so he's exposing himself to one of getline's
gotchas. He's also exposing himself to two other getline gotchas but no
need to go into that as long as he just follows the advice given and
avoids getline for this exercise.

<snip>
> You should process the second file within your begin clause as well,
> for the behavior you want.


No, he shouldn't process either file in the BEGIN clause.

Ed.

mjc

2006-10-17, 6:55 pm


durant wrote:[color=darkred]
> here is what i am trying to do and cannot get "hello" to print in the
> main section. plus i inculded sample input files if needed.
>
> #!s:\rdg\unixutil\usr\local\wbin\gawk -f
> BEGIN {
> while (getline < "first_file.csv" > 0){
> print $1 >"first_file.out";
> n++;
> }
> close("first_file.csv");
> print n;
> }
> {
> print "hello";
> #
> while ( getline <"second_file.csv" > 0){
> print $1,$2;
> }
> }
>
> first_file.csv
> UWI
> 42251301080000
> 42251301080000
> 42251301080000
> 42251301080000
>
> second_file.csv
> API,OPER
> 42121300750,2.99
> 42121300820,3
> 42121300880,4
> 42121300950,28.91
> 42121300970,92.00
> 42121301040,6
>
>
> Ed Morton wrote:

I don't know why getline is looked down on - I often use it.

Anyway, to do this the way you want, just put the second getline loop
in the BEGIN along with the first and then put an exit at the end to
prevent awk waiting for input:

BEGIN {
while (getline < "first_file.csv" > 0){
print $1 >"first_file.out";
n++;
}
close("first_file.csv");
print n;

#
while ( getline <"second_file.csv" > 0){
print $1,$2;
}
exit;
}

Ed Morton

2006-10-17, 6:55 pm

mjc wrote:

<snip>
> I don't know why getline is looked down on - I often use it.


getline is fine when used correctly, but it's best avoided by default
because:

a) It allows people to stick to their preconceived ideas of how to
program rather than learning the easier way that awk was designed to
read input. It's like C programmers continuing to do procedural
programming in C++ rather than learning the new paradigm and the
supporting language constructs.

b) It has many insidious caveats that come back to bite you either
immediately or in future. I've tried to capture some of those and
explain when getline IS appropriate in a separate thread I just started
in this NG.

As the good book (Effective Awk Programming, Third Edition By Arnold
Robbins; http://www.oreilly.com/catalog/awkprog3) says:

"The getline command is used in several different ways and should not be
used by beginners. ... come back and study the getline command after you
have reviewed the rest ... and have a good knowledge of how awk works."

Regards,

Ed.
Rufus V. Smith

2006-10-20, 6:55 pm


"Ed Morton" <morton@lsupcaemnt.com> wrote in message
news:_fOdnabTmoEgbanYnZ2dnUVZ_tWdnZ2d@co
mcast.com...
> Rufus V. Smith wrote:
>
> Please don't top-post. Fixed below.
>
>
> No he doesn't. Maybe some of the greater-than or less-than symbols look
> like brackets in your viewer. He does, however, have his bracketing
> incorrect for using getline so he's exposing himself to one of getline's
> gotchas. He's also exposing himself to two other getline gotchas but no
> need to go into that as long as he just follows the advice given and
> avoids getline for this exercise.
>
> <snip>
>
> No, he shouldn't process either file in the BEGIN clause.
>
> Ed.
>


Sorry about the prior toppost. Outlook express sets me up for the fall
every time,
and as we use outlook at work, sometimes I forget.

I understand (and agree) with your advice about doing it the correct way,
and I don't know enough about awk to know the getline gotchas.

But are you're saying it simply will not work the way I (and mjc) suggested?

Rufus


Grant

2006-10-20, 6:55 pm

On Fri, 20 Oct 2006 13:13:37 -0400, "Rufus V. Smith" <nospam@nospam.com> wrote:

>I understand (and agree) with your advice about doing it the correct way,
>and I don't know enough about awk to know the getline gotchas.


If you want to compare working script examples using both models to
setup the same in-memory database:

http://bugsplatter.mine.nu/junkview/junkview

Search for: function read_database_files(s) and the function it calls,
which reads database files with getline in the END section, as the
program already read log file/s in the awk file reader section.

Compare it with:

http://bugsplatter.mine.nu/junkview/ip2c-server

which reads the same files to memory using the awk file reader before
entering the END section.

Grant.
--
http://bugsplatter.mine.nu/
Ed Morton

2006-10-20, 6:55 pm

Rufus V. Smith wrote:

> "Ed Morton" <morton@lsupcaemnt.com> wrote in message
> news:_fOdnabTmoEgbanYnZ2dnUVZ_tWdnZ2d@co
mcast.com...
>
<snip>[color=darkred]
<snip>[color=darkred]
> But are you're saying it simply will not work the way I (and mjc) suggested?
>
> Rufus
>


What you suggested should work if the awk you invoke parses the getline
command line the way you want (the syntax used is ambiguous) and you
don't write any code in future that relies on any awk variables (NF, $0,
FILENAME, etc.) being unset in the BEGIN section and you don't trip over
any other getline gotchas, none of which spring to mind right now for
this case but that doesn't mean they don't exist.

It's mainly just a very obtuse way to do it as you're being forced to
manually re-write the record reading work-loop that's a fundamental part
of awk already.

Ed.
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com