For Programmers: Free Programming Magazines  


Home > Archive > AWK > April 2005 > Re: How to combine two awk commands









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Re: How to combine two awk commands
Jonny

2005-04-07, 3:56 pm

I wrote:

> I am piping the output from one awk command to another awk command, and
> I was wondering if it is possible to combine them.
>
> The first command is just to print the second field of a file:
>
> awk "BEGIN {FS = \" \"} {print $2}"
>
> and the second command is to remove duplicates from the (unsorted)
> result of the first command:
>
> awk "{if (data[$0]++ == 0) lines[++count] = $0} END {for (i = 1; i
> <=count; i++) print lines[i]}"
>
> Please could you tell me if it is possible to combine them.



Thanks hq00e and Ed for your replies.

I found Ed's answer:

awk "BEGIN{ FS=\" \" } { array[$2]++ } END{ for ( i in array ) print i
}"

to run slightly faster than hq00e's:

awk "BEGIN{ FS=\" \" } { array[$2]=$2 } END{ for ( i in array ) print
array[i] }"

and both of them are running about 40% to 45% quicker than my
two-command approach.

Ed wrote:

> but you'd probably be changing the order of the output compared to the
> input by using the "in" operator this way (see
> http://www.gnu.org/software/gawk/ma...anning-an-Array)
> which may not be desirable.


I don't mind, since the input file was not in any particular order
anyway.

Your help is much appreciated.

Regards,
Jonny
Ed Morton

2005-04-07, 8:55 pm



Jonny wrote:
<snip>
> I found Ed's answer:
>
> awk "BEGIN{ FS=\" \" } { array[$2]++ } END{ for ( i in array ) print i
> }"
>
> to run slightly faster than hq00e's:
>
> awk "BEGIN{ FS=\" \" } { array[$2]=$2 } END{ for ( i in array ) print
> array[i] }"
>
> and both of them are running about 40% to 45% quicker than my
> two-command approach.


Did you try my original proposal (from 4/6):

awk "BEGIN{FS = \" \"}!date[$2]++"

I'd expect it to be the fastest. The one above was just pointing out
potential improvements to hq00es proposal, but I wouldn't really do it
that way.

Ed.
Jonny

2005-04-08, 8:55 am

Ed Morton wrote:

> Did you try my original proposal (from 4/6):
>
> awk "BEGIN{FS = \" \"}!date[$2]++"
>
> I'd expect it to be the fastest. The one above was just pointing out
> potential improvements to hq00es proposal, but I wouldn't really do it
> that way.


Hmm. I don't have that posting in my list, and the postings in my
newsreader are sorted by date. Strange.

Anyway, I tried the above command. Perhaps I'm missing something, but
it prints the first field and the second field, but I just wanted the
second field to be printed.

I don't know why, but it was actually slightly slower than your:

awk "BEGIN{ FS=\" \" } { array[$2]++ } END{ for ( i in array ) print
i}"

Regards,
Jonny
Ed Morton

2005-04-08, 3:56 pm



Jonny wrote:

> Ed Morton wrote:
>
>
>
>
> Hmm. I don't have that posting in my list, and the postings in my
> newsreader are sorted by date. Strange.


Happens to me using Netscape too. google groups has a lot of problems
but at least it seems to catch all the postings! Having said that, I
still use Netscape and don't worry about the few I miss.

> Anyway, I tried the above command. Perhaps I'm missing something, but
> it prints the first field and the second field, but I just wanted the
> second field to be printed.


Yeah, you're right. It should've been:

awk "BEGIN{FS = \" \"}!date[$2]++{print $2}"

I just noticed that you're setting the FS to it's default value, a
single space, so you don't actually need that BEGIN section:

awk "!date[$2]++{print $2}"

> I don't know why, but it was actually slightly slower than your:
>
> awk "BEGIN{ FS=\" \" } { array[$2]++ } END{ for ( i in array ) print
> i}"
>


That's very surprising since it's avoiding the loop and array indexing.
When I ran both commands on a file that was 100000 lines long, I got
these results:

PS1> time gawk '!array[$2]++{print $2}'

real 0m1.67s
user 0m1.25s
sys 0m0.14s

PS1> time gawk '{ array[$2]++ } END{ for ( i in array ) print i}'

real 0m1.28s
user 0m1.09s
sys 0m0.15s

Beats me....

Ed.
Loki Harfagr

2005-04-08, 3:56 pm

Le Fri, 08 Apr 2005 08:50:47 -0500, Ed Morton a écrit_:

>
>
> Jonny wrote:
>
....[color=darkred]
>
> Yeah, you're right. It should've been:
>
> awk "BEGIN{FS = \" \"}!date[$2]++{print $2}"
>
> I just noticed that you're setting the FS to it's default value, a
> single space, so you don't actually need that BEGIN section:
>
> awk "!date[$2]++{print $2}"
>
>
> That's very surprising since it's avoiding the loop and array indexing.
> When I ran both commands on a file that was 100000 lines long, I got
> these results:
>
> PS1> time gawk '!array[$2]++{print $2}'
>
> real 0m1.67s
> user 0m1.25s
> sys 0m0.14s
>
> PS1> time gawk '{ array[$2]++ } END{ for ( i in array ) print i}'
>
> real 0m1.28s
> user 0m1.09s
> sys 0m0.15s
>
> Beats me....
>
> Ed.


Yes, I get the same order of speed ununderstanding
on a big file.
$ time awk '!date[$2]++{print $2}' testfileHUGER
real 0m0.606s
user 0m0.433s
sys 0m0.085s
$ wc testfileHUGER
360000 1536000 15024000 testfileHUGER

Another thing I *don't* understand here is
the usability of doublequotes and the bang!
and/or the $

In bash, csh, ksh here I can(t get them to work out but
these errors (I'd have promised) :

$ awk "!date[$2]++{print $2}" testfile
bash: !date[$2]++{print: event not found

$ csh
% awk "!date[$2]++{print $2}" testfile
date[: Event not found.
%exit

$ ksh
u@h:w$ awk "!date[$2]++{print $2}" testfile
awk: !date[]++{print }
awk: ^ syntax error
awk: Fatal: sous-expression invalide


Would you explain slowly why it works in your environments ?~D)
(whether I put LC_ALL to C or fr_FR it's still the same
on mine)
Ed Morton

2005-04-08, 3:56 pm

NNTP-Posting-Host: morton-2.ih.lucent.com
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.2) Gecko/20040804 Netscape/7.2 (ax)
X-Accept-Language: en-us, en
In-Reply-To: <425695f5$0$28392$626a14ce@news.free.fr>
Xref: newsfeed-west.nntpserver.com comp.lang.awk:25834



Loki Harfagr wrote:
<snip>
> Another thing I *don't* understand here is
> the usability of doublequotes and the bang!
> and/or the $


"!" is the "not" operator. "$" dereferences an argument. Use of double
quotes instead of single is, I believe, a windows thing. I never use it
and wouldn't expect it to work on any UNIX environment.

Ed.
Loki Harfagr

2005-04-08, 8:56 pm

Le Fri, 08 Apr 2005 12:59:47 -0500, Ed Morton a écrit_:

>
>
> Loki Harfagr wrote:
> <snip>
>
> "!" is the "not" operator. "$" dereferences an argument.


Well, you know I know this :-)

> Use of double
> quotes instead of single is, I believe, a windows thing.


Allright ! I guess you found out the very trick !
I feel really sorry having posted such a dull question
without even thinking about it :D)

> I never use it


Well, I bet I never even tried to, before these posts ...

> and wouldn't expect it to work on any UNIX environment.


I confirm it *does* **not** work

> Ed.


Thanks for the relief Ed. I even thought I was going blind,
might be this case of conjunctivitis I got these days.
Woah, Windows, never got the idea it could've been the point;
I should've pointed out that "Noworyta" is a win32 app ...
I guess I'm gonna have a better wend now :D)
Cheers indeed.
Have a nice wend too, Ed.

PS: as a complement for previous posts about speed and
perfs, I "awka'd" to C the scripts and the same rates of differences
appear against all odds.
Next step (when/if I find some time not writing
stoopeedeeteez on Usenet) would be I'd write a sharpened C version of it ...
We'll see :-=)

Ed Morton

2005-04-10, 3:55 am



Jonny wrote:

> Ed Morton wrote:
>
>
>
>
> Hmm. I don't have that posting in my list, and the postings in my
> newsreader are sorted by date. Strange.


Happens to me using Netscape too. google groups has a lot of problems
but at least it seems to catch all the postings! Having said that, I
still use Netscape and don't worry about the few I miss.

> Anyway, I tried the above command. Perhaps I'm missing something, but
> it prints the first field and the second field, but I just wanted the
> second field to be printed.


Yeah, you're right. It should've been:

awk "BEGIN{FS = \" \"}!date[$2]++{print $2}"

I just noticed that you're setting the FS to it's default value, a
single space, so you don't actually need that BEGIN section:

awk "!date[$2]++{print $2}"

> I don't know why, but it was actually slightly slower than your:
>
> awk "BEGIN{ FS=\" \" } { array[$2]++ } END{ for ( i in array ) print
> i}"
>


That's very surprising since it's avoiding the loop and array indexing.
When I ran both commands on a file that was 100000 lines long, I got
these results:

PS1> time gawk '!array[$2]++{print $2}'

real 0m1.67s
user 0m1.25s
sys 0m0.14s

PS1> time gawk '{ array[$2]++ } END{ for ( i in array ) print i}'

real 0m1.28s
user 0m1.09s
sys 0m0.15s

Beats me....

Ed.
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com