For Programmers: Free Programming Magazines  


Home > Archive > AWK > January 2006 > Running AWK serially









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Running AWK serially
Lurker

2006-01-10, 3:58 am

I have created an AWK script which is run over a series of files passed
on the command line, but have been totally flummoxed with the strange
behaviour I get when counting unique values. This is the error
behaviour shown in full with the command line and console output...

awk -f scripts/find_slice_values.awk fieldname="genre" `find catalog
-type f -name machine.txt` | sort | uniq
[11]ic_design
graphic_design [8]
[2]stration
illustration [2]

On reflection, Awk appears to be spawning a parallel instance when the
number of files on the command line exceeds a certain number. Whatever
is causing it, counts of specific string patterns, and print behaviour
seem to be screwed up, since it seems like more than one count array is
created, and print commands overwrite each other.

I remember encountering some description of this behaviour
(parallelizing when more than X files are passed in) but I can't
remember the exact terms to describe it, or work out how to turn it
off.

The script processes multiple machine-readable metadata files, and
constructs indexes of unique values found across multiple files to
assist in navigation of the directory contents.

An example machine.txt file looks like this...

machine.txt[color=darkred]
format:brochure
clients: Goltsblat LTD
media: digital
genre: graphic design, illustration
<<<<<<<<<<

with a filesystem layout of machine readable text files like this...

find catalog -name machine.txt
catalog/arch_brochure/machine.txt
catalog/bc_catalogue/machine.txt
catalog/bc_marching/machine.txt
catalog/cat_poster/machine.txt
catalog/doubledecker/machine.txt
catalog/hmco/machine.txt
catalog/mos_annual/machine.txt
catalog/mos_cd/machine.txt
catalog/mos_holiday2003/machine.txt
catalog/mos_mediaday2001/machine.txt
catalog/mos_mediaday2004/machine.txt
catalog/rg_booklets/machine.txt
catalog/risd_poster/machine.txt
catalog/risd_poster2/machine.txt
catalog/risd_poster3/machine.txt
catalog/sartcouncil_poster/machine.txt
catalog/tuscan_cover/machine.txt
catalog/wisdom_spirit/machine.txt
catalog/ymaa_kendo/machine.txt

The script to process multiple files is like this...

find_slice_values.awk[color=darkred]
# Limits matches to those under the parameter
# fieldname
# if this parameter is passed in from the command line

BEGIN{
FS=":";
}

$1 ~ fieldname {

#splits values at commas and dumps spaces
split($2,values," *, *");

# trims whitespace
# replaces spaces with underscores
# keeps count of unique values
for(idx in values){
valuename = values[idx];
gsub(/^[ ]+|[ ]+$/,"", valuename);
gsub(/ /,"_", valuename);
valuecount[valuename]++;
}

}

END {
# outputs unique values and counts
for(valuename in valuecount){
print valuename, "[" valuecount[valuename] "]";
}
}
<<<<<<<<<<<

Ed Morton

2006-01-10, 3:58 am

Lurker wrote:
> I have created an AWK script which is run over a series of files passed
> on the command line, but have been totally flummoxed with the strange
> behaviour I get when counting unique values. This is the error
> behaviour shown in full with the command line and console output...
>
> awk -f scripts/find_slice_values.awk fieldname="genre" `find catalog
> -type f -name machine.txt` | sort | uniq


<OT>
FYI
sort | uniq
is the same as
sort -u
</OT>

> [11]ic_design
> graphic_design [8]
> [2]stration
> illustration [2]
>
> On reflection, Awk appears to be spawning a parallel instance when the
> number of files on the command line exceeds a certain number.


I think it's much more likely that you have control characters in your
input files. Find a small subset (e.g. 1 or 2) that reproduce the
problem, and run "cat -v" on them to see what control chars they contain.

Ed.
Lurker

2006-01-10, 3:58 am

Thanks, Ed. Absolutely right. Didn't recognise the symptoms.

cat -v revealed dos line breaks which look like ^M
acquired dos2unix (from fink on Mac OS X)
changed all the files to remove dodgy dos line breaks.

All fixed!

Turns out the editor of these files had copied some of the text from
Microsoft Word into a text editor (although on a Mac). This spawn of
satan program had managed to corrupt the text files with it's
non-conformant line breaks.

Also using sort -u now, thanks for the pointer.

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com