Code Comments
Programming Forum and web based access to our favorite programming groups.I wrote:
> I am piping the output from one awk command to another awk command, and
> I was wondering if it is possible to combine them.
>
> The first command is just to print the second field of a file:
>
> awk "BEGIN {FS = \" \"} {print $2}"
>
> and the second command is to remove duplicates from the (unsorted)
> result of the first command:
>
> awk "{if (data[$0]++ == 0) lines[++count] = $0} END {for (i = 1; i
> <=count; i++) print lines[i]}"
>
> Please could you tell me if it is possible to combine them.
Thanks hq00e and Ed for your replies.
I found Ed's answer:
awk "BEGIN{ FS=\" \" } { array[$2]++ } END{ for ( i in array ) print i
}"
to run slightly faster than hq00e's:
awk "BEGIN{ FS=\" \" } { array[$2]=$2 } END{ for ( i in array ) print
array[i] }"
and both of them are running about 40% to 45% quicker than my
two-command approach.
Ed wrote:
> but you'd probably be changing the order of the output compared to the
> input by using the "in" operator this way (see
> http://www.gnu.org/software/gawk/ma...anning-an-Array)
> which may not be desirable.
I don't mind, since the input file was not in any particular order
anyway.
Your help is much appreciated.
Regards,
Jonny
Post Follow-up to this message
Jonny wrote:
<snip>
> I found Ed's answer:
>
> awk "BEGIN{ FS=\" \" } { array[$2]++ } END{ for ( i in array ) print i
> }"
>
> to run slightly faster than hq00e's:
>
> awk "BEGIN{ FS=\" \" } { array[$2]=$2 } END{ for ( i in array ) print
> array[i] }"
>
> and both of them are running about 40% to 45% quicker than my
> two-command approach.
Did you try my original proposal (from 4/6):
awk "BEGIN{FS = \" \"}!date[$2]++"
I'd expect it to be the fastest. The one above was just pointing out
potential improvements to hq00es proposal, but I wouldn't really do it
that way.
Ed.
Post Follow-up to this messageEd Morton wrote:
> Did you try my original proposal (from 4/6):
>
> awk "BEGIN{FS = \" \"}!date[$2]++"
>
> I'd expect it to be the fastest. The one above was just pointing out
> potential improvements to hq00es proposal, but I wouldn't really do it
> that way.
Hmm. I don't have that posting in my list, and the postings in my
newsreader are sorted by date. Strange.
Anyway, I tried the above command. Perhaps I'm missing something, but
it prints the first field and the second field, but I just wanted the
second field to be printed.
I don't know why, but it was actually slightly slower than your:
awk "BEGIN{ FS=\" \" } { array[$2]++ } END{ for ( i in array ) print
i}"
Regards,
Jonny
Post Follow-up to this messageJonny wrote: > Ed Morton wrote: > > > > > Hmm. I don't have that posting in my list, and the postings in my > newsreader are sorted by date. Strange. Happens to me using Netscape too. google groups has a lot of problems but at least it seems to catch all the postings! Having said that, I still use Netscape and don't worry about the few I miss. > Anyway, I tried the above command. Perhaps I'm missing something, but > it prints the first field and the second field, but I just wanted the > second field to be printed. Yeah, you're right. It should've been: awk "BEGIN{FS = \" \"}!date[$2]++{print $2}" I just noticed that you're setting the FS to it's default value, a single space, so you don't actually need that BEGIN section: awk "!date[$2]++{print $2}" > I don't know why, but it was actually slightly slower than your: > > awk "BEGIN{ FS=\" \" } { array[$2]++ } END{ for ( i in array ) print > i}" > That's very surprising since it's avoiding the loop and array indexing. When I ran both commands on a file that was 100000 lines long, I got these results: PS1> time gawk '!array[$2]++{print $2}' real 0m1.67s user 0m1.25s sys 0m0.14s PS1> time gawk '{ array[$2]++ } END{ for ( i in array ) print i}' real 0m1.28s user 0m1.09s sys 0m0.15s Beats me.... Ed.
Post Follow-up to this messageLe Fri, 08 Apr 2005 08:50:47 -0500, Ed Morton a écrit_:
>
>
> Jonny wrote:
>
...
>
> Yeah, you're right. It should've been:
>
> awk "BEGIN{FS = \" \"}!date[$2]++{print $2}"
>
> I just noticed that you're setting the FS to it's default value, a
> single space, so you don't actually need that BEGIN section:
>
> awk "!date[$2]++{print $2}"
>
>
> That's very surprising since it's avoiding the loop and array indexing.
> When I ran both commands on a file that was 100000 lines long, I got
> these results:
>
> PS1> time gawk '!array[$2]++{print $2}'
>
> real 0m1.67s
> user 0m1.25s
> sys 0m0.14s
>
> PS1> time gawk '{ array[$2]++ } END{ for ( i in array ) print i}'
>
> real 0m1.28s
> user 0m1.09s
> sys 0m0.15s
>
> Beats me....
>
> Ed.
Yes, I get the same order of speed ununderstanding
on a big file.
$ time awk '!date[$2]++{print $2}' testfileHUGER
real 0m0.606s
user 0m0.433s
sys 0m0.085s
$ wc testfileHUGER
360000 1536000 15024000 testfileHUGER
Another thing I *don't* understand here is
the usability of doublequotes and the bang!
and/or the $
In bash, csh, ksh here I can(t get them to work out but
these errors (I'd have promised) :
$ awk "!date[$2]++{print $2}" testfile
bash: !date[$2]++{print: event not found
$ csh
% awk "!date[$2]++{print $2}" testfile
date[: Event not found.
%exit
$ ksh
u@h:w$ awk "!date[$2]++{print $2}" testfile
awk: !date[]++{print }
awk: ^ syntax error
awk: Fatal: sous-expression invalide
Would you explain slowly why it works in your environments ?~D)
(whether I put LC_ALL to C or fr_FR it's still the same
on mine)
Post Follow-up to this messageNNTP-Posting-Host: morton-2.ih.lucent.com Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.2) Gecko/ 20040804 Netscape/7.2 (ax) X-Accept-Language: en-us, en In-Reply-To: <425695f5$0$28392$626a14ce@news.free.fr> Xref: newsfeed-west.nntpserver.com comp.lang.awk:25834 Loki Harfagr wrote: <snip> > Another thing I *don't* understand here is > the usability of doublequotes and the bang! > and/or the $ "!" is the "not" operator. "$" dereferences an argument. Use of double quotes instead of single is, I believe, a windows thing. I never use it and wouldn't expect it to work on any UNIX environment. Ed.
Post Follow-up to this messageLe Fri, 08 Apr 2005 12:59:47 -0500, Ed Morton a écrit_: > > > Loki Harfagr wrote: > <snip> > > "!" is the "not" operator. "$" dereferences an argument. Well, you know I know this :-) > Use of double > quotes instead of single is, I believe, a windows thing. Allright ! I guess you found out the very trick ! I feel really sorry having posted such a dull question without even thinking about it :D) > I never use it Well, I bet I never even tried to, before these posts ... > and wouldn't expect it to work on any UNIX environment. I confirm it *does* **not** work > Ed. Thanks for the relief Ed. I even thought I was going blind, might be this case of conjunctivitis I got these days. Woah, Windows, never got the idea it could've been the point; I should've pointed out that "Noworyta" is a win32 app ... I guess I'm gonna have a better wend now :D) Cheers indeed. Have a nice w
end too, Ed. PS: as a complement for previous posts about speed and perfs, I "awka'd" to C the scripts and the same rates of differences appear against all odds. Next step (when/if I find some time not writing stoopeedeeteez on Usenet) would be I'd write a sharpened C version of it ... We'll see :-=)
Post Follow-up to this messageJonny wrote: > Ed Morton wrote: > > > > > Hmm. I don't have that posting in my list, and the postings in my > newsreader are sorted by date. Strange. Happens to me using Netscape too. google groups has a lot of problems but at least it seems to catch all the postings! Having said that, I still use Netscape and don't worry about the few I miss. > Anyway, I tried the above command. Perhaps I'm missing something, but > it prints the first field and the second field, but I just wanted the > second field to be printed. Yeah, you're right. It should've been: awk "BEGIN{FS = \" \"}!date[$2]++{print $2}" I just noticed that you're setting the FS to it's default value, a single space, so you don't actually need that BEGIN section: awk "!date[$2]++{print $2}" > I don't know why, but it was actually slightly slower than your: > > awk "BEGIN{ FS=\" \" } { array[$2]++ } END{ for ( i in array ) print > i}" > That's very surprising since it's avoiding the loop and array indexing. When I ran both commands on a file that was 100000 lines long, I got these results: PS1> time gawk '!array[$2]++{print $2}' real 0m1.67s user 0m1.25s sys 0m0.14s PS1> time gawk '{ array[$2]++ } END{ for ( i in array ) print i}' real 0m1.28s user 0m1.09s sys 0m0.15s Beats me.... Ed.
Post Follow-up to this messagePowered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.