For Programmers: Free Programming Magazines  


Home > Archive > AWK > May 2005 > command-line vs. script file









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author command-line vs. script file
Sebastian Luque

2005-05-14, 3:57 pm

Hello,

I'm cleaning some very large files of the form (these are csv files):

"Date","Time","Moisture","Grade","Temperature",
17/12/2001,20:15:15,80,20,2,
....

and I'm using awk to get rid of empty lines, lines that don't start with
the expected characters, and of lines that don't have the same number of
fields as expected from the header row. I've been able to write a script
for getting rid of the empty lines, but not for the latter goal:

,-----[ setupFiles.awk (lines: 1 - 13) ]
| #! /usr/bin/awk -f
| # AWK script for cleaning files
|
| # Change this to reflect the number of fields expected in the output
| # Set this equal to the number of channels output by the instrument
| BEGIN { FS = "," }
|
| # First we get rid of empty lines or lines that begin with anything other
| # than digits or with single-quote (header row).
| /^[0-9\"]/
|
| # We get rid of lines that contain wrong number of fields
| # I don't know how to do this so far!
`-----

The funny thing is I can do the next step by running:

$ setupFiles.awk myfile > pre-result
$ awk -F, 'NF == 5' pre-result > result

How in the world is one supposed to include that last command in the
script setupFiles.awk ? Another question is how to have the script read
the number of fields in the first row and then use that value to choose
only those rows that have exactly that number of fields. Any ideas would
be most welcome. Thanks in advance.

Cheers,
Sebastian
--
Sebastian P. Luque
Janis Papanagnou

2005-05-14, 3:57 pm

Sebastian Luque wrote:
> Hello,
>
> I'm cleaning some very large files of the form (these are csv files):
>
> "Date","Time","Moisture","Grade","Temperature",
>


> ...
>
> and I'm using awk to get rid of empty lines, lines that don't start with
> the expected characters, and of lines that don't have the same number of
> fields as expected from the header row. I've been able to write a script
> for getting rid of the empty lines, but not for the latter goal:
>
> ,-----[ setupFiles.awk (lines: 1 - 13) ]
> | #! /usr/bin/awk -f
> | # AWK script for cleaning files
> |
> | # Change this to reflect the number of fields expected in the output
> | # Set this equal to the number of channels output by the instrument
> | BEGIN { FS = "," }
> |
> | # First we get rid of empty lines or lines that begin with anything other
> | # than digits or with single-quote (header row).
> | /^[0-9\"]/
> |
> | # We get rid of lines that contain wrong number of fields
> | # I don't know how to do this so far!
> `-----
>
> The funny thing is I can do the next step by running:
>
> $ setupFiles.awk myfile > pre-result
> $ awk -F, 'NF == 5' pre-result > result
>
> How in the world is one supposed to include that last command in the
> script setupFiles.awk ? Another question is how to have the script read
> the number of fields in the first row and then use that value to choose
> only those rows that have exactly that number of fields. Any ideas would
> be most welcome. Thanks in advance.


! nf { nf = NF } nf == NF


Janis
Ed Morton

2005-05-14, 3:57 pm



Sebastian Luque wrote:
> Hello,
>
> I'm cleaning some very large files of the form (these are csv files):
>
> "Date","Time","Moisture","Grade","Temperature",
> 17/12/2001,20:15:15,80,20,2,
> ...
>
> and I'm using awk to get rid of empty lines, lines that don't start with
> the expected characters, and of lines that don't have the same number of
> fields as expected from the header row. I've been able to write a script
> for getting rid of the empty lines, but not for the latter goal:
>
> ,-----[ setupFiles.awk (lines: 1 - 13) ]
> | #! /usr/bin/awk -f
> | # AWK script for cleaning files
> |
> | # Change this to reflect the number of fields expected in the output
> | # Set this equal to the number of channels output by the instrument
> | BEGIN { FS = "," }
> |
> | # First we get rid of empty lines or lines that begin with anything other
> | # than digits or with single-quote (header row).
> | /^[0-9\"]/


That will FIND lines that begin with those chars and so print them.
That's very different from discarding the ones that don't as you say you
want to do. To do that you'd want:

!/^[0-9\"]/ {next}

> | # We get rid of lines that contain wrong number of fields
> | # I don't know how to do this so far!
> `-----


NF != 5 {next}

> The funny thing is I can do the next step by running:
>
> $ setupFiles.awk myfile > pre-result
> $ awk -F, 'NF == 5' pre-result > result


Again, that FINDS lines with NF equal to 5 which is very different from
discarding the ones that don't, which would be:

awk -F, 'NF != 5{next}1' pre-result > result

where that's a number "one" at the end to specify a true condition which
invokes the default behavior of printing $0.

> How in the world is one supposed to include that last command in the
> script setupFiles.awk ?


You can either merge the lines I gave above (or keep them separate):

BEGIN { FS = "," }
(!/^[0-9\"]/) || (NF != 5) {next}
1

The parens in the conditions may not be necessary but help calrify.
Alternatively, rather than discarding the lines you don't want (beware
double negatives!), you could do this to find the ones you do:

BEGIN { FS = "," }
/^[0-9\"]/ && (NF == 5)

Another question is how to have the script read
> the number of fields in the first row and then use that value to choose
> only those rows that have exactly that number of fields.


BEGIN { FS = "," }
NR == 1 { nf = NF }
/^[0-9\"]/ && (NF == nf)

Any ideas would
> be most welcome. Thanks in advance.


Regards,

Ed.
Chris F.A. Johnson

2005-05-14, 3:57 pm

On Fri, 13 May 2005 at 21:36 GMT, Sebastian Luque wrote:
> Hello,
>
> I'm cleaning some very large files of the form (these are csv files):
>
> "Date","Time","Moisture","Grade","Temperature",
> 17/12/2001,20:15:15,80,20,2,
> ...
>
> and I'm using awk to get rid of empty lines, lines that don't start with
> the expected characters, and of lines that don't have the same number of
> fields as expected from the header row. I've been able to write a script
> for getting rid of the empty lines, but not for the latter goal:
>
> ,-----[ setupFiles.awk (lines: 1 - 13) ]
>| #! /usr/bin/awk -f
>| # AWK script for cleaning files
>|
>| # Change this to reflect the number of fields expected in the output
>| # Set this equal to the number of channels output by the instrument
>| BEGIN { FS = "," }
>|
>| # First we get rid of empty lines or lines that begin with anything other
>| # than digits or with single-quote (header row).
>| /^[0-9\"]/
>|
>| # We get rid of lines that contain wrong number of fields
>| # I don't know how to do this so far!
> `-----
>
> The funny thing is I can do the next step by running:
>
> $ setupFiles.awk myfile > pre-result
> $ awk -F, 'NF == 5' pre-result > result
>
> How in the world is one supposed to include that last command in the
> script setupFiles.awk ?


Just put it in the file:

NF == 5 { print }

> Another question is how to have the script read the number of fields
> in the first row and then use that value to choose only those rows
> that have exactly that number of fields.


NR == 1 { flds = NF }
/^[0-9\"]/ && NF == flds { print }

--
Chris F.A. Johnson <http://cfaj.freeshell.org>
========================================
==========================
Shell Scripting Recipes: A Problem-Solution Approach, 2005, Apress
<http://www.torfree.net/~chris/books/ssr.html>
Sebastian Luque

2005-05-14, 3:57 pm

Brilliant, problem solved!, but most importantly you've taught me
tons of new stuff, thank you all for your prompt replies!

Sebastian
--
Sebastian P. Luque
Ed Morton

2005-05-14, 3:57 pm



Chris F.A. Johnson wrote:

> On Fri, 13 May 2005 at 21:36 GMT, Sebastian Luque wrote:
>
>
>
> Just put it in the file:
>
> NF == 5 { print }


That could be reduced to just:

NF == 5

but it'd still be wrong ;-) . If he added that to his file, it's print
every line that starts with an applicable character and has the right
number of fields twice, etc...

>
>
>
> NR == 1 { flds = NF }
> /^[0-9\"]/ && NF == flds { print }
>


That's do it. Still no need for the "{ print }" though.

Ed.
Sebastian Luque

2005-05-14, 3:57 pm

Hi again,

With your help my script now looks like this:


,-----[ setupFiles.awk (lines: 1 - 13) ]
| #! /usr/bin/awk -f
| # AWK script for cleaning TDR files
|
| # Change this to reflect the number of fields expected in the output
| # Set this equal to the number of channels output by the instrument
| BEGIN { FS = "," }
|
| # Get only lines with the proper number of fields
| NR == 1 { flds = NF }
|
| # Pick only lines that begin with a digit or the single quote
| /^[0-9\"]/ && NF == flds
`-----

and I've been trying to add an action to the records selected in the last
statement, so that it modifies fields in some files. These unusual files
look look like this:

"Date","Time","Moisture","Grade","Temperature",
17/12/2001,20:15:15,Moist=80,Grad=20,Temp=2,
....

so I modified the last line to get rid of "Moist=", "Grad=", and "Temp=":

/^[0-9\"]/ && NF == flds { sub(/Moist=|Grad=|Temp=/, ""); print }

that resulted in every row, except the header row, being eliminated. I
thought that by omitting the last argument in sub(regexp, replacement [,
target]), this would apply to the whole record. So I don't understand what
I'm doing wrong here.

Thanks a lot so far,
--
Sebastian P. Luque
Bill Seivert

2005-05-14, 3:57 pm



Sebastian Luque wrote:
> Hi again,
>
> With your help my script now looks like this:
>
>
> ,-----[ setupFiles.awk (lines: 1 - 13) ]
> | #! /usr/bin/awk -f
> | # AWK script for cleaning TDR files
> |
> | # Change this to reflect the number of fields expected in the output
> | # Set this equal to the number of channels output by the instrument
> | BEGIN { FS = "," }
> |
> | # Get only lines with the proper number of fields
> | NR == 1 { flds = NF }
> |
> | # Pick only lines that begin with a digit or the single quote
> | /^[0-9\"]/ && NF == flds
> `-----
>
> and I've been trying to add an action to the records selected in the last
> statement, so that it modifies fields in some files. These unusual files
> look look like this:
>
> "Date","Time","Moisture","Grade","Temperature",
> 17/12/2001,20:15:15,Moist=80,Grad=20,Temp=2,
> ...
>
> so I modified the last line to get rid of "Moist=", "Grad=", and "Temp=":
>
> /^[0-9\"]/ && NF == flds { sub(/Moist=|Grad=|Temp=/, ""); print }
>
> that resulted in every row, except the header row, being eliminated. I
> thought that by omitting the last argument in sub(regexp, replacement [,
> target]), this would apply to the whole record. So I don't understand what
> I'm doing wrong here.
>
> Thanks a lot so far,


If you change NR == 1 line to
NR == 1 { flds=NF; print}

then you can use
/^[0-9]/ && NF == flds { gsub (/,[^,]*=/, ","); print }

eliminating the \" from the pattern. Note that you want to use gsub,
instead of sub, otherwise only the first field will be changed.

Bill Seivert

Sebastian Luque

2005-05-14, 3:57 pm

Hi Bill,


Bill Seivert <seivert@pcisys.net> wrote:

[...]

> If you change NR == 1 line to
> NR == 1 { flds=NF; print}


I thought the print action was not needed as it was already implied, but
this forced me to wire the pattern-action structure of awk into my head.


> then you can use
> /^[0-9]/ && NF == flds { gsub (/,[^,]*=/, ","); print }
>
> eliminating the \" from the pattern. Note that you want to use gsub,
> instead of sub, otherwise only the first field will be changed.


I tried that but it's just printing the header row and nothing else. I
guess my problem is the regexp; I need to get rid of all the letters and
the equal sign in every row except the first. I also tried this, instead
of your line above:

/^[0-9]/ && NF == flds { gsub (/[Aa-Zz]*=/, ""); print }

and got the same result. This is proving a good test of my regexp skills!

Thanks,
--
Sebastian P. Luque
Ed Morton

2005-05-14, 7:11 pm



Sebastian Luque wrote:
> Hi Bill,
>
>
> Bill Seivert <seivert@pcisys.net> wrote:
>
> [...]
>
>
>
>
> I thought the print action was not needed as it was already implied,


print $0

is the DEFAULT action. If you specify any other action, then that's what
awk does instead of the default action, so if you want to do sever
things plus print $0, you need to do that explicitly.

but
> this forced me to wire the pattern-action structure of awk into my head.
>
>
>
>
>
> I tried that but it's just printing the header row and nothing else. I
> guess my problem is the regexp; I need to get rid of all the letters and
> the equal sign in every row except the first. I also tried this, instead
> of your line above:
>
> /^[0-9]/ && NF == flds { gsub (/[Aa-Zz]*=/, ""); print }


Instead of [Az-Za], you want [a-zA-Z]. Alternatively use a character class:

/^[0-9]/ && NF == flds { gsub (/[[:alpha:]]*=/, ""); print }

Regards,

Ed.

>
> and got the same result. This is proving a good test of my regexp skills!
>
> Thanks,

Sebastian Luque

2005-05-14, 7:21 pm

Sebastian Luque <sluque@mun.ca> wrote:

[...]

> I tried that but it's just printing the header row and nothing else. I
> guess my problem is the regexp; I need to get rid of all the letters and
> the equal sign in every row except the first. I also tried this, instead
> of your line above:
>
> /^[0-9]/ && NF == flds { gsub (/[Aa-Zz]*=/, ""); print }
>
> and got the same result.


I'm sorry, it was totally my fault, your regexp wasn't working because the
first field in those unusual files is also different, so /^[0-9]/ wasn't
working! In fact, the first field in those files was not accurately
described in my example. This is a more accurate example:

"Date","Time","Moisture","Grade","Temperature",
Jan-08-2001,20:15:15,Moist=80,Grad=20,Temp=2,
....

So I need to fix the dates before I do anything else. I need to modify
that field so the date reads "25/01/2001. I'm half way there:

,-----[ setupFiles.awk (lines: 11 - 26) ]
| # Change month names to numeric form
| $1 ~ /^[Aa-Zz]/ {
| sub(/Jan-/, "01/", $1);
| sub(/Feb-/, "02/", $1);
| sub(/Mar-/, "03/", $1);
| sub(/Apr-/, "04/", $1);
| sub(/May-/, "05/", $1);
| sub(/Jun-/, "06/", $1);
| sub(/Jul-/, "07/", $1);
| sub(/Aug-/, "08/", $1);
| sub(/Sep-/, "09/", $1);
| sub(/Oct-/, "10/", $1);
| sub(/Nov-/, "11/", $1);
| sub(/Dec-/, "12/", $1);
| print }
`-----

and I haven't yet been able to swith the day and month around. It doesn't
look like awk allows the use of \D to allow something like this to work:

sub(/Jan-(..)-/, "\1/01/", $1)
^^

What's the awk alternative for \D?

Thanks,
--
Sebastian P. Luque
Sebastian Luque

2005-05-14, 8:55 pm

Sebastian Luque <sluque@mun.ca> wrote:

[...]

> ,-----[ setupFiles.awk (lines: 11 - 26) ]
> | # Change month names to numeric form
> | $1 ~ /^[Aa-Zz]/ {
> | sub(/Jan-/, "01/", $1);
> | sub(/Feb-/, "02/", $1);
> | sub(/Mar-/, "03/", $1);
> | sub(/Apr-/, "04/", $1);
> | sub(/May-/, "05/", $1);
> | sub(/Jun-/, "06/", $1);
> | sub(/Jul-/, "07/", $1);
> | sub(/Aug-/, "08/", $1);
> | sub(/Sep-/, "09/", $1);
> | sub(/Oct-/, "10/", $1);
> | sub(/Nov-/, "11/", $1);
> | sub(/Dec-/, "12/", $1);
> | print }
> `-----


As a side note, this code is ignoring my BEGIN { FS = "," } which I
defined at the start of my script. So all rows except the first are
printed with space instead of comma as the field separator.

--
Sebastian P. Luque
Ed Morton

2005-05-15, 3:55 am



Sebastian Luque wrote:
> Sebastian Luque <sluque@mun.ca> wrote:
>
> [...]
>
>

You don't need those semicolons.
[color=darkred]
>
> As a side note, this code is ignoring my BEGIN { FS = "," } which I
> defined at the start of my script. So all rows except the first are
> printed with space instead of comma as the field separator.
>


FS contains the input field separator. The default output field
separator is still a space. You can control that by setting OFS, e.g.:

BEGIN{FS=OFS=","}

wrt your other posting:

> and I haven't yet been able to swith the day and month around. It doesn't
> look like awk allows the use of \D to allow something like this to work:
>
> sub(/Jan-(..)-/, "\1/01/", $1)
> ^^
>
> What's the awk alternative for \D?


In sub() and gsub() it's just "&" for the whole pattern matched, In gawk
you also have gensub() which uses \\1, etc., e.g.:

$1 = gensub(/Jan-(..)-/, "\\1/01/", "", $1)

Regards,

Ed.
Ed Morton

2005-05-15, 3:55 am



Sebastian Luque wrote:

<snip>
> ,-----[ setupFiles.awk (lines: 11 - 26) ]
> | # Change month names to numeric form
> | $1 ~ /^[Aa-Zz]/ {
> | sub(/Jan-/, "01/", $1);
> | sub(/Feb-/, "02/", $1);
> | sub(/Mar-/, "03/", $1);
> | sub(/Apr-/, "04/", $1);
> | sub(/May-/, "05/", $1);
> | sub(/Jun-/, "06/", $1);
> | sub(/Jul-/, "07/", $1);
> | sub(/Aug-/, "08/", $1);
> | sub(/Sep-/, "09/", $1);
> | sub(/Oct-/, "10/", $1);
> | sub(/Nov-/, "11/", $1);
> | sub(/Dec-/, "12/", $1);
> | print }
> `-----



Consider doing this instead of repeating the sub() for each month:;

awk 'BEGIN{
mthNames="Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec"
split(mthNames,tmp)
for (i in tmp) mths[tmp[i]] = sprintf("%02d",i)
}
{ for (mth in mths) sub(mth, mths[mth], $1) }
1'

Regards,

Ed.
Kenny McCormack

2005-05-15, 3:55 am

In article <dbGdnZyrFfb0NBvfRVn-gg@comcast.com>,
Ed Morton <morton@lsupcaemnt.com> wrote:
....
>Consider doing this instead of repeating the sub() for each month:;
>
>awk 'BEGIN{
> mthNames="Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec"
> split(mthNames,tmp)
> for (i in tmp) mths[tmp[i]] = sprintf("%02d",i)
>}
>{ for (mth in mths) sub(mth, mths[mth], $1) }
>1'


How about :

match("JanFebMarAprMayJunJulAugSepOctNovDec",substr($1,1,3)) {
sub(/.../,sprintf("%02d",(RSTART+2)/3),$1)
}

Sebastian Luque

2005-05-15, 8:55 am

gazelle@yin.interaccess.com (Kenny McCormack) wrote:

[...]

> How about :
>
> match("JanFebMarAprMayJunJulAugSepOctNovDec",substr($1,1,3)) {
> sub(/.../,sprintf("%02d",(RSTART+2)/3),$1)
> }


Very nice indeed! Thanks to all for your help, my script is now working
very well. Here's how it looks:

,-----[ setupFiles.awk (lines: 1 - 27) ]
| #! /usr/bin/awk -f
| # AWK script for cleaning files
|
| BEGIN {
| FS = OFS = ",";
| }
|
| # Fix date
| match("JanFebMarAprMayJunJulAugSepOctNovDec", substr($1, 1, 3)) {
| sub(/.../, sprintf("%02d", (RSTART+2)/3, $1));
| day = substr($1, 4, 2); month = substr($1, 1, 2);
| $1 = (day "/" month "/2001")
| }
|
| # Prepare to get only lines with the proper number of fields
| NR == 1 {
| flds = NF;
| gsub(/ /, "");
| print
| }
|
| # Pick only lines that begin with a digit and have all expected fields
| /^[0-9]/ && NF == flds {
| gsub(/[[:alpha:]]*=| /, "");
| print
| }
`-----

Sorry Ed ;-) I found out I have to keep the semi colons there so Emacs'
awk-mode indents correctly (what are people using for editing awk
programs?).

Last thing I need to have this script do is pick up its input files from a
directory, and then output files named automatically. Here's my failing
attempt to do this:

BEGIN {
FS = OFS = ",";
outfile = substr(FILENAME, 1, length(FILENAME) - 4) "-b.csv"
}

and then have all my print actions in the script above as "print >
outfile". I called the program as 'setupFiles.awk infile > /dev/null', but
it creates a file named "-b.csv", so it seems to be completely ignoring my
substr function. Does anybody know some way to do this?

Thanks again!
Sebastian
--
Sebastian P. Luque
Ed Morton

2005-05-15, 3:55 pm



Sebastian Luque wrote:
> gazelle@yin.interaccess.com (Kenny McCormack) wrote:
>
> [...]
>
>
>
>
> Very nice indeed! Thanks to all for your help, my script is now working
> very well. Here's how it looks:
>
> ,-----[ setupFiles.awk (lines: 1 - 27) ]
> | #! /usr/bin/awk -f
> | # AWK script for cleaning files
> |
> | BEGIN {
> | FS = OFS = ",";
> | }
> |
> | # Fix date
> | match("JanFebMarAprMayJunJulAugSepOctNovDec", substr($1, 1, 3)) {
> | sub(/.../, sprintf("%02d", (RSTART+2)/3, $1));
> | day = substr($1, 4, 2); month = substr($1, 1, 2);
> | $1 = (day "/" month "/2001")
> | }
> |
> | # Prepare to get only lines with the proper number of fields
> | NR == 1 {
> | flds = NF;
> | gsub(/ /, "");
> | print
> | }
> |
> | # Pick only lines that begin with a digit and have all expected fields
> | /^[0-9]/ && NF == flds {
> | gsub(/[[:alpha:]]*=| /, "");
> | print
> | }
> `-----
>
> Sorry Ed ;-) I found out I have to keep the semi colons there so Emacs'
> awk-mode indents correctly (what are people using for editing awk
> programs?).


vi/vim. There was a recent thread about awk editors you could search for.

> Last thing I need to have this script do is pick up its input files from a
> directory, and then output files named automatically. Here's my failing
> attempt to do this:
>
> BEGIN {
> FS = OFS = ",";
> outfile = substr(FILENAME, 1, length(FILENAME) - 4) "-b.csv"


I wouldn't deliberately create a file name with "-"s in it, or any other
chararacter execept letters, digits, periods, or underscores. That can
lead to requiring complications in scripts if you ever want to do
anything with that file later. I'd use an underscore instead.

> }
>
> and then have all my print actions in the script above as "print >
> outfile". I called the program as 'setupFiles.awk infile > /dev/null', but
> it creates a file named "-b.csv", so it seems to be completely ignoring my
> substr function. Does anybody know some way to do this?


FILENAME is not set in the BEGIN section since awk isn't reading any
file at that time. Either do that in the NR==1 part of the body, or use
ARGV[1] in the BEGIN section.

Ed.
Chris F.A. Johnson

2005-05-15, 3:55 pm

On Sun, 15 May 2005 at 13:21 GMT, Ed Morton wrote:
>
>
> I wouldn't deliberately create a file name with "-"s in it, or any other
> chararacter execept letters, digits, periods, or underscores. That can
> lead to requiring complications in scripts if you ever want to do
> anything with that file later. I'd use an underscore instead.


There's nothing wrong with "-" in a filename, except at the
beginning. The POSIX portable filename standard allows letters,
numbers, periods, hyphens and underscores, but a name may not
begin with a hyphen.

--
Chris F.A. Johnson <http://cfaj.freeshell.org>
========================================
==========================
Shell Scripting Recipes: A Problem-Solution Approach, 2005, Apress
<http://www.torfree.net/~chris/books/ssr.html>
Kenny McCormack

2005-05-15, 3:55 pm

In article <sa7kl2-9o2.ln1@rogers.com>,
Chris F.A. Johnson <cfajohnson@gmail.com> wrote:
....
> There's nothing wrong with "-" in a filename, except at the
> beginning. The POSIX portable filename standard allows letters,
> numbers, periods, hyphens and underscores, but a name may not
> begin with a hyphen.


That, of course, is not at all the point. It *might* be the point if this
were comp.unix.shell - but it, amazingly enough, is not.

The point *is* that embedding any strange characters (I think Ed put it
very well, BTW) can (I don't say will, but can) cause problems down the
road. And you'd be surprised, once you get used to doing this stuff on
a regular basis, how interconnected this stuff can be - how your choice of
a variablename or filename on, say, a Unix platform, might someday cause
a problem for somebody working on another part of the project on, say, an
MS platform.

That said, it is also true that spaces and other goofy characters in
filenames is a part of modern life and one probably has to buck up and deal
with it. This debate comes up periodically in the shell groups and it goes
back and forth between "Don't cause unnecessary problems" and "But be ready
and able to deal with the unnecessary problems created by other people".

Sebastian Luque

2005-05-15, 8:55 pm

Ed Morton <morton@lsupcaemnt.com> wrote:

[...]

> I wouldn't deliberately create a file name with "-"s in it, or any other
> chararacter execept letters, digits, periods, or underscores. That can
> lead to requiring complications in scripts if you ever want to do
> anything with that file later. I'd use an underscore instead.


Yes, this script may need to be used by other people in different
systems, so I did use the underscore.

> FILENAME is not set in the BEGIN section since awk isn't reading any
> file at that time. Either do that in the NR==1 part of the body, or use
> ARGV[1] in the BEGIN section.


That's it, FILENAME hasn't been set in BEGIN. I have a hard time
understanding what has and hasn't been defined at this point. I thought
that even though no lines had been read then, things like FILENAME had.
Anyway, putting it where you suggested first (in the NR == 1 part) worked
great.

Running the script as:

setupFiles.awk input-file > /dev/null

works ok, but this:

setupFiles.awk input-file1 input-file2 [etc.]

only processes the first file, and surprisingly (to me):

ls | setupFiles.awk > /dev/null

produces only a "_b.csv" file, i.e. the append portion of my new file name
definition, which contains a single line with the full *new* file name.
Any help understanding what's going on here would be greatly appreciated.

Thank you,
--
Sebastian P. Luque
Ed Morton

2005-05-15, 8:55 pm



Sebastian Luque wrote:
<snip>
> Running the script as:
>
> setupFiles.awk input-file > /dev/null
>
> works ok, but this:
>
> setupFiles.awk input-file1 input-file2 [etc.]
>
> only processes the first file, and surprisingly (to me):


Are both files formatted the same? If you need a different format (e.g.
number of applicable fields) read on the first line of each file, you
need to use FNR==1 instead of NR==1. FNR is reset for every file, NR is
the total for all files.

> ls | setupFiles.awk > /dev/null
>
> produces only a "_b.csv" file, i.e. the append portion of my new file name
> definition, which contains a single line with the full *new* file name.
> Any help understanding what's going on here would be greatly appreciated.


Right, you're not passing it any file names, just having it read stdin.
You can pass it a list of files by changing:

ls | setupFiles.awk

to:

setupFiles.awk `ls`

but that won't work for file names that contain spaces. To handle that
you'd need something like:

ls | while read file
do
setupFiles.awk "$file"
don

but that's getting OT for this group. comp.unix.shell is the best place
for these types of UNIX questions.

Ed.

Chris Croughton

2005-05-16, 8:55 am

On Sun, 15 May 2005 08:21:42 -0500, Ed Morton
<morton@lsupcaemnt.com> wrote:

> Sebastian Luque wrote:
>
> vi/vim. There was a recent thread about awk editors you could search for.


I'm also using vim, and it also wants the semicolons to keep the
indentation correct. Since I also write C and C++ I put semicolons on
automatically, and they don't do any harm in awk.

Chris C
Sebastian Luque

2005-05-17, 8:55 pm

Hi again,

With your help my script now looks like this:


,-----[ setupFiles.awk (lines: 1 - 13) ]
| #! /usr/bin/awk -f
| # AWK script for cleaning TDR files
|
| # Change this to reflect the number of fields expected in the output
| # Set this equal to the number of channels output by the instrument
| BEGIN { FS = "," }
|
| # Get only lines with the proper number of fields
| NR == 1 { flds = NF }
|
| # Pick only lines that begin with a digit or the single quote
| /^[0-9\"]/ && NF == flds
`-----

and I've been trying to add an action to the records selected in the last
statement, so that it modifies fields in some files. These unusual files
look look like this:

"Date","Time","Moisture","Grade","Temperature",
17/12/2001,20:15:15,Moist=80,Grad=20,Temp=2,
....

so I modified the last line to get rid of "Moist=", "Grad=", and "Temp=":

/^[0-9\"]/ && NF == flds { sub(/Moist=|Grad=|Temp=/, ""); print }

that resulted in every row, except the header row, being eliminated. I
thought that by omitting the last argument in sub(regexp, replacement [,
target]), this would apply to the whole record. So I don't understand what
I'm doing wrong here.

Thanks a lot so far,
--
Sebastian P. Luque
Bill Seivert

2005-05-17, 8:55 pm



Sebastian Luque wrote:
> Hi again,
>
> With your help my script now looks like this:
>
>
> ,-----[ setupFiles.awk (lines: 1 - 13) ]
> | #! /usr/bin/awk -f
> | # AWK script for cleaning TDR files
> |
> | # Change this to reflect the number of fields expected in the output
> | # Set this equal to the number of channels output by the instrument
> | BEGIN { FS = "," }
> |
> | # Get only lines with the proper number of fields
> | NR == 1 { flds = NF }
> |
> | # Pick only lines that begin with a digit or the single quote
> | /^[0-9\"]/ && NF == flds
> `-----
>
> and I've been trying to add an action to the records selected in the last
> statement, so that it modifies fields in some files. These unusual files
> look look like this:
>
> "Date","Time","Moisture","Grade","Temperature",
> 17/12/2001,20:15:15,Moist=80,Grad=20,Temp=2,
> ...
>
> so I modified the last line to get rid of "Moist=", "Grad=", and "Temp=":
>
> /^[0-9\"]/ && NF == flds { sub(/Moist=|Grad=|Temp=/, ""); print }
>
> that resulted in every row, except the header row, being eliminated. I
> thought that by omitting the last argument in sub(regexp, replacement [,
> target]), this would apply to the whole record. So I don't understand what
> I'm doing wrong here.
>
> Thanks a lot so far,


If you change NR == 1 line to
NR == 1 { flds=NF; print}

then you can use
/^[0-9]/ && NF == flds { gsub (/,[^,]*=/, ","); print }

eliminating the \" from the pattern. Note that you want to use gsub,
instead of sub, otherwise only the first field will be changed.

Bill Seivert

Sebastian Luque

2005-05-17, 8:55 pm

Hi Bill,


Bill Seivert <seivert@pcisys.net> wrote:

[...]

> If you change NR == 1 line to
> NR == 1 { flds=NF; print}


I thought the print action was not needed as it was already implied, but
this forced me to wire the pattern-action structure of awk into my head.


> then you can use
> /^[0-9]/ && NF == flds { gsub (/,[^,]*=/, ","); print }
>
> eliminating the \" from the pattern. Note that you want to use gsub,
> instead of sub, otherwise only the first field will be changed.


I tried that but it's just printing the header row and nothing else. I
guess my problem is the regexp; I need to get rid of all the letters and
the equal sign in every row except the first. I also tried this, instead
of your line above:

/^[0-9]/ && NF == flds { gsub (/[Aa-Zz]*=/, ""); print }

and got the same result. This is proving a good test of my regexp skills!

Thanks,
--
Sebastian P. Luque
Ed Morton

2005-05-17, 8:55 pm



Sebastian Luque wrote:
> Hi Bill,
>
>
> Bill Seivert <seivert@pcisys.net> wrote:
>
> [...]
>
>
>
>
> I thought the print action was not needed as it was already implied,


print $0

is the DEFAULT action. If you specify any other action, then that's what
awk does instead of the default action, so if you want to do sever
things plus print $0, you need to do that explicitly.

but
> this forced me to wire the pattern-action structure of awk into my head.
>
>
>
>
>
> I tried that but it's just printing the header row and nothing else. I
> guess my problem is the regexp; I need to get rid of all the letters and
> the equal sign in every row except the first. I also tried this, instead
> of your line above:
>
> /^[0-9]/ && NF == flds { gsub (/[Aa-Zz]*=/, ""); print }


Instead of [Az-Za], you want [a-zA-Z]. Alternatively use a character class:

/^[0-9]/ && NF == flds { gsub (/[[:alpha:]]*=/, ""); print }

Regards,

Ed.

>
> and got the same result. This is proving a good test of my regexp skills!
>
> Thanks,

Sebastian Luque

2005-05-17, 8:55 pm

Sebastian Luque <sluque@mun.ca> wrote:

[...]

> I tried that but it's just printing the header row and nothing else. I
> guess my problem is the regexp; I need to get rid of all the letters and
> the equal sign in every row except the first. I also tried this, instead
> of your line above:
>
> /^[0-9]/ && NF == flds { gsub (/[Aa-Zz]*=/, ""); print }
>
> and got the same result.


I'm sorry, it was totally my fault, your regexp wasn't working because the
first field in those unusual files is also different, so /^[0-9]/ wasn't
working! In fact, the first field in those files was not accurately
described in my example. This is a more accurate example:

"Date","Time","Moisture","Grade","Temperature",
Jan-08-2001,20:15:15,Moist=80,Grad=20,Temp=2,
....

So I need to fix the dates before I do anything else. I need to modify
that field so the date reads "25/01/2001. I'm half way there:

,-----[ setupFiles.awk (lines: 11 - 26) ]
| # Change month names to numeric form
| $1 ~ /^[Aa-Zz]/ {
| sub(/Jan-/, "01/", $1);
| sub(/Feb-/, "02/", $1);
| sub(/Mar-/, "03/", $1);
| sub(/Apr-/, "04/", $1);
| sub(/May-/, "05/", $1);
| sub(/Jun-/, "06/", $1);
| sub(/Jul-/, "07/", $1);
| sub(/Aug-/, "08/", $1);
| sub(/Sep-/, "09/", $1);
| sub(/Oct-/, "10/", $1);
| sub(/Nov-/, "11/", $1);
| sub(/Dec-/, "12/", $1);
| print }
`-----

and I haven't yet been able to swith the day and month around. It doesn't
look like awk allows the use of \D to allow something like this to work:

sub(/Jan-(..)-/, "\1/01/", $1)
^^

What's the awk alternative for \D?

Thanks,
--
Sebastian P. Luque
Sebastian Luque

2005-05-17, 8:55 pm

Sebastian Luque <sluque@mun.ca> wrote:

[...]

> ,-----[ setupFiles.awk (lines: 11 - 26) ]
> | # Change month names to numeric form
> | $1 ~ /^[Aa-Zz]/ {
> | sub(/Jan-/, "01/", $1);
> | sub(/Feb-/, "02/", $1);
> | sub(/Mar-/, "03/", $1);
> | sub(/Apr-/, "04/", $1);
> | sub(/May-/, "05/", $1);
> | sub(/Jun-/, "06/", $1);
> | sub(/Jul-/, "07/", $1);
> | sub(/Aug-/, "08/", $1);
> | sub(/Sep-/, "09/", $1);
> | sub(/Oct-/, "10/", $1);
> | sub(/Nov-/, "11/", $1);
> | sub(/Dec-/, "12/", $1);
> | print }
> `-----


As a side note, this code is ignoring my BEGIN { FS = "," } which I
defined at the start of my script. So all rows except the first are
printed with space instead of comma as the field separator.

--
Sebastian P. Luque
Ed Morton

2005-05-17, 8:55 pm



Sebastian Luque wrote:
> Sebastian Luque <sluque@mun.ca> wrote:
>
> [...]
>
>

You don't need those semicolons.
[color=darkred]
>
> As a side note, this code is ignoring my BEGIN { FS = "," } which I
> defined at the start of my script. So all rows except the first are
> printed with space instead of comma as the field separator.
>


FS contains the input field separator. The default output field
separator is still a space. You can control that by setting OFS, e.g.:

BEGIN{FS=OFS=","}

wrt your other posting:

> and I haven't yet been able to swith the day and month around. It doesn't
> look like awk allows the use of \D to allow something like this to work:
>
> sub(/Jan-(..)-/, "\1/01/", $1)
> ^^
>
> What's the awk alternative for \D?


In sub() and gsub() it's just "&" for the whole pattern matched, In gawk
you also have gensub() which uses \\1, etc., e.g.:

$1 = gensub(/Jan-(..)-/, "\\1/01/", "", $1)

Regards,

Ed.
Ed Morton

2005-05-17, 8:55 pm



Sebastian Luque wrote:

<snip>
> ,-----[ setupFiles.awk (lines: 11 - 26) ]
> | # Change month names to numeric form
> | $1 ~ /^[Aa-Zz]/ {
> | sub(/Jan-/, "01/", $1);
> | sub(/Feb-/, "02/", $1);
> | sub(/Mar-/, "03/", $1);
> | sub(/Apr-/, "04/", $1);
> | sub(/May-/, "05/", $1);
> | sub(/Jun-/, "06/", $1);
> | sub(/Jul-/, "07/", $1);
> | sub(/Aug-/, "08/", $1);
> | sub(/Sep-/, "09/", $1);
> | sub(/Oct-/, "10/", $1);
> | sub(/Nov-/, "11/", $1);
> | sub(/Dec-/, "12/", $1);
> | print }
> `-----



Consider doing this instead of repeating the sub() for each month:;

awk 'BEGIN{
mthNames="Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec"
split(mthNames,tmp)
for (i in tmp) mths[tmp[i]] = sprintf("%02d",i)
}
{ for (mth in mths) sub(mth, mths[mth], $1) }
1'

Regards,

Ed.
Kenny McCormack

2005-05-18, 3:57 am

In article <dbGdnZyrFfb0NBvfRVn-gg@comcast.com>,
Ed Morton <morton@lsupcaemnt.com> wrote:
....
>Consider doing this instead of repeating the sub() for each month:;
>
>awk 'BEGIN{
> mthNames="Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec"
> split(mthNames,tmp)
> for (i in tmp) mths[tmp[i]] = sprintf("%02d",i)
>}
>{ for (mth in mths) sub(mth, mths[mth], $1) }
>1'


How about :

match("JanFebMarAprMayJunJulAugSepOctNovDec",substr($1,1,3)) {
sub(/.../,sprintf("%02d",(RSTART+2)/3),$1)
}

Sebastian Luque

2005-05-18, 3:57 am

gazelle@yin.interaccess.com (Kenny McCormack) wrote:

[...]

> How about :
>
> match("JanFebMarAprMayJunJulAugSepOctNovDec",substr($1,1,3)) {
> sub(/.../,sprintf("%02d",(RSTART+2)/3),$1)
> }


Very nice indeed! Thanks to all for your help, my script is now working
very well. Here's how it looks:

,-----[ setupFiles.awk (lines: 1 - 27) ]
| #! /usr/bin/awk -f
| # AWK script for cleaning files
|
| BEGIN {
| FS = OFS = ",";
| }
|
| # Fix date
| match("JanFebMarAprMayJunJulAugSepOctNovDec", substr($1, 1, 3)) {
| sub(/.../, sprintf("%02d", (RSTART+2)/3, $1));
| day = substr($1, 4, 2); month = substr($1, 1, 2);
| $1 = (day "/" month "/2001")
| }
|
| # Prepare to get only lines with the proper number of fields
| NR == 1 {
| flds = NF;
| gsub(/ /, "");
| print
| }
|
| # Pick only lines that begin with a digit and have all expected fields
| /^[0-9]/ && NF == flds {
| gsub(/[[:alpha:]]*=| /, "");
| print
| }
`-----

Sorry Ed ;-) I found out I have to keep the semi colons there so Emacs'
awk-mode indents correctly (what are people using for editing awk
programs?).

Last thing I need to have this script do is pick up its input files from a
directory, and then output files named automatically. Here's my failing
attempt to do this:

BEGIN {
FS = OFS = ",";
outfile = substr(FILENAME, 1, length(FILENAME) - 4) "-b.csv"
}

and then have all my print actions in the script above as "print >
outfile". I called the program as 'setupFiles.awk infile > /dev/null', but
it creates a file named "-b.csv", so it seems to be completely ignoring my
substr function. Does anybody know some way to do this?

Thanks again!
Sebastian
--
Sebastian P. Luque
Ed Morton

2005-05-18, 3:57 am



Sebastian Luque wrote:
> gazelle@yin.interaccess.com (Kenny McCormack) wrote:
>
> [...]
>
>
>
>
> Very nice indeed! Thanks to all for your help, my script is now working
> very well. Here's how it looks:
>
> ,-----[ setupFiles.awk (lines: 1 - 27) ]
> | #! /usr/bin/awk -f
> | # AWK script for cleaning files
> |
> | BEGIN {
> | FS = OFS = ",";
> | }
> |
> | # Fix date
> | match("JanFebMarAprMayJunJulAugSepOctNovDec", substr($1, 1, 3)) {
> | sub(/.../, sprintf("%02d", (RSTART+2)/3, $1));
> | day = substr($1, 4, 2); month = substr($1, 1, 2);
> | $1 = (day "/" month "/2001")
> | }
> |
> | # Prepare to get only lines with the proper number of fields
> | NR == 1 {
> | flds = NF;
> | gsub(/ /, "");
> | print
> | }
> |
> | # Pick only lines that begin with a digit and have all expected fields
> | /^[0-9]/ && NF == flds {
> | gsub(/[[:alpha:]]*=| /, "");
> | print
> | }
> `-----
>
> Sorry Ed ;-) I found out I have to keep the semi colons there so Emacs'
> awk-mode indents correctly (what are people using for editing awk
> programs?).


vi/vim. There was a recent thread about awk editors you could search for.

> Last thing I need to have this script do is pick up its input files from a
> directory, and then output files named automatically. Here's my failing
> attempt to do this:
>
> BEGIN {
> FS = OFS = ",";
> outfile = substr(FILENAME, 1, length(FILENAME) - 4) "-b.csv"


I wouldn't deliberately create a file name with "-"s in it, or any other
chararacter execept letters, digits, periods, or underscores. That can
lead to requiring complications in scripts if you ever want to do
anything with that file later. I'd use an underscore instead.

> }
>
> and then have all my print actions in the script above as "print >
> outfile". I called the program as 'setupFiles.awk infile > /dev/null', but
> it creates a file named "-b.csv", so it seems to be completely ignoring my
> substr function. Does anybody know some way to do this?


FILENAME is not set in the BEGIN section since awk isn't reading any
file at that time. Either do that in the NR==1 part of the body, or use
ARGV[1] in the BEGIN section.

Ed.
Chris F.A. Johnson

2005-05-18, 3:57 am

On Sun, 15 May 2005 at 13:21 GMT, Ed Morton wrote:
>
>
> I wouldn't deliberately create a file name with "-"s in it, or any other
> chararacter execept letters, digits, periods, or underscores. That can
> lead to requiring complications in scripts if you ever want to do
> anything with that file later. I'd use an underscore instead.


There's nothing wrong with "-" in a filename, except at the
beginning. The POSIX portable filename standard allows letters,
numbers, periods, hyphens and underscores, but a name may not
begin with a hyphen.

--
Chris F.A. Johnson <http://cfaj.freeshell.org>
========================================
==========================
Shell Scripting Recipes: A Problem-Solution Approach, 2005, Apress
<http://www.torfree.net/~chris/books/ssr.html>
Kenny McCormack

2005-05-18, 3:57 am

In article <sa7kl2-9o2.ln1@rogers.com>,
Chris F.A. Johnson <cfajohnson@gmail.com> wrote:
....
> There's nothing wrong with "-" in a filename, except at the
> beginning. The POSIX portable filename standard allows letters,
> numbers, periods, hyphens and underscores, but a name may not
> begin with a hyphen.


That, of course, is not at all the point. It *might* be the point if this
were comp.unix.shell - but it, amazingly enough, is not.

The point *is* that embedding any strange characters (I think Ed put it
very well, BTW) can (I don't say will, but can) cause problems down the
road. And you'd be surprised, once you get used to doing this stuff on
a regular basis, how interconnected this stuff can be - how your choice of
a variablename or filename on, say, a Unix platform, might someday cause
a problem for somebody working on another part of the project on, say, an
MS platform.

That said, it is also true that spaces and other goofy characters in
filenames is a part of modern life and one probably has to buck up and deal
with it. This debate comes up periodically in the shell groups and it goes
back and forth between "Don't cause unnecessary problems" and "But be ready
and able to deal with the unnecessary problems created by other people".

Sebastian Luque

2005-05-18, 3:57 am

Ed Morton <morton@lsupcaemnt.com> wrote:

[...]

> I wouldn't deliberately create a file name with "-"s in it, or any other
> chararacter execept letters, digits, periods, or underscores. That can
> lead to requiring complications in scripts if you ever want to do
> anything with that file later. I'd use an underscore instead.


Yes, this script may need to be used by other people in different
systems, so I did use the underscore.

> FILENAME is not set in the BEGIN section since awk isn't reading any
> file at that time. Either do that in the NR==1 part of the body, or use
> ARGV[1] in the BEGIN section.


That's it, FILENAME hasn't been set in BEGIN. I have a hard time
understanding what has and hasn't been defined at this point. I thought
that even though no lines had been read then, things like FILENAME had.
Anyway, putting it where you suggested first (in the NR == 1 part) worked
great.

Running the script as:

setupFiles.awk input-file > /dev/null

works ok, but this:

setupFiles.awk input-file1 input-file2 [etc.]

only processes the first file, and surprisingly (to me):

ls | setupFiles.awk > /dev/null

produces only a "_b.csv" file, i.e. the append portion of my new file name
definition, which contains a single line with the full *new* file name.
Any help understanding what's going on here would be greatly appreciated.

Thank you,
--
Sebastian P. Luque
Ed Morton

2005-05-18, 3:57 am



Sebastian Luque wrote:
<snip>
> Running the script as:
>
> setupFiles.awk input-file > /dev/null
>
> works ok, but this:
>
> setupFiles.awk input-file1 input-file2 [etc.]
>
> only processes the first file, and surprisingly (to me):


Are both files formatted the same? If you need a different format (e.g.
number of applicable fields) read on the first line of each file, you
need to use FNR==1 instead of NR==1. FNR is reset for every file, NR is
the total for all files.

> ls | setupFiles.awk > /dev/null
>
> produces only a "_b.csv" file, i.e. the append portion of my new file name
> definition, which contains a single line with the full *new* file name.
> Any help understanding what's going on here would be greatly appreciated.


Right, you're not passing it any file names, just having it read stdin.
You can pass it a list of files by changing:

ls | setupFiles.awk

to:

setupFiles.awk `ls`

but that won't work for file names that contain spaces. To handle that
you'd need something like:

ls | while read file
do
setupFiles.awk "$file"
don

but that's getting OT for this group. comp.unix.shell is the best place
for these types of UNIX questions.

Ed.

Chris Croughton

2005-05-18, 3:57 am

On Sun, 15 May 2005 08:21:42 -0500, Ed Morton
<morton@lsupcaemnt.com> wrote:

> Sebastian Luque wrote:
>
> vi/vim. There was a recent thread about awk editors you could search for.


I'm also using vim, and it also wants the semicolons to keep the
indentation correct. Since I also write C and C++ I put semicolons on
automatically, and they don't do any harm in awk.

Chris C
Kenny McCormack

2005-05-20, 3:55 pm

In article <sa7kl2-9o2.ln1@rogers.com>,
Chris F.A. Johnson <cfajohnson@gmail.com> wrote:
....
> There's nothing wrong with "-" in a filename, except at the
> beginning. The POSIX portable filename standard allows letters,
> numbers, periods, hyphens and underscores, but a name may not
> begin with a hyphen.


That, of course, is not at all the point. It *might* be the point if this
were comp.unix.shell - but it, amazingly enough, is not.

The point *is* that embedding any strange characters (I think Ed put it
very well, BTW) can (I don't say will, but can) cause problems down the
road. And you'd be surprised, once you get used to doing this stuff on
a regular basis, how interconnected this stuff can be - how your choice of
a variablename or filename on, say, a Unix platform, might someday cause
a problem for somebody working on another part of the project on, say, an
MS platform.

That said, it is also true that spaces and other goofy characters in
filenames is a part of modern life and one probably has to buck up and deal
with it. This debate comes up periodically in the shell groups and it goes
back and forth between "Don't cause unnecessary problems" and "But be ready
and able to deal with the unnecessary problems created by other people".

Chris Croughton

2005-05-20, 3:55 pm

On Sun, 15 May 2005 08:21:42 -0500, Ed Morton
<morton@lsupcaemnt.com> wrote:

> Sebastian Luque wrote:
>
> vi/vim. There was a recent thread about awk editors you could search for.


I'm also using vim, and it also wants the semicolons to keep the
indentation correct. Since I also write C and C++ I put semicolons on
automatically, and they don't do any harm in awk.

Chris C
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com