Home > Archive > AWK > April 2005 > Re: How to combine two awk commands
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Re: How to combine two awk commands
|
|
|
| I wrote:
> I am piping the output from one awk command to another awk command, and
> I was wondering if it is possible to combine them.
>
> The first command is just to print the second field of a file:
>
> awk "BEGIN {FS = \" \"} {print $2}"
>
> and the second command is to remove duplicates from the (unsorted)
> result of the first command:
>
> awk "{if (data[$0]++ == 0) lines[++count] = $0} END {for (i = 1; i
> <=count; i++) print lines[i]}"
>
> Please could you tell me if it is possible to combine them.
Thanks hq00e and Ed for your replies.
I found Ed's answer:
awk "BEGIN{ FS=\" \" } { array[$2]++ } END{ for ( i in array ) print i
}"
to run slightly faster than hq00e's:
awk "BEGIN{ FS=\" \" } { array[$2]=$2 } END{ for ( i in array ) print
array[i] }"
and both of them are running about 40% to 45% quicker than my
two-command approach.
Ed wrote:
> but you'd probably be changing the order of the output compared to the
> input by using the "in" operator this way (see
> http://www.gnu.org/software/gawk/ma...anning-an-Array)
> which may not be desirable.
I don't mind, since the input file was not in any particular order
anyway.
Your help is much appreciated.
Regards,
Jonny
| |
| Ed Morton 2005-04-07, 8:55 pm |
|
Jonny wrote:
<snip>
> I found Ed's answer:
>
> awk "BEGIN{ FS=\" \" } { array[$2]++ } END{ for ( i in array ) print i
> }"
>
> to run slightly faster than hq00e's:
>
> awk "BEGIN{ FS=\" \" } { array[$2]=$2 } END{ for ( i in array ) print
> array[i] }"
>
> and both of them are running about 40% to 45% quicker than my
> two-command approach.
Did you try my original proposal (from 4/6):
awk "BEGIN{FS = \" \"}!date[$2]++"
I'd expect it to be the fastest. The one above was just pointing out
potential improvements to hq00es proposal, but I wouldn't really do it
that way.
Ed.
| |
|
| Ed Morton wrote:
> Did you try my original proposal (from 4/6):
>
> awk "BEGIN{FS = \" \"}!date[$2]++"
>
> I'd expect it to be the fastest. The one above was just pointing out
> potential improvements to hq00es proposal, but I wouldn't really do it
> that way.
Hmm. I don't have that posting in my list, and the postings in my
newsreader are sorted by date. Strange.
Anyway, I tried the above command. Perhaps I'm missing something, but
it prints the first field and the second field, but I just wanted the
second field to be printed.
I don't know why, but it was actually slightly slower than your:
awk "BEGIN{ FS=\" \" } { array[$2]++ } END{ for ( i in array ) print
i}"
Regards,
Jonny
| |
| Ed Morton 2005-04-08, 3:56 pm |
|
Jonny wrote:
> Ed Morton wrote:
>
>
>
>
> Hmm. I don't have that posting in my list, and the postings in my
> newsreader are sorted by date. Strange.
Happens to me using Netscape too. google groups has a lot of problems
but at least it seems to catch all the postings! Having said that, I
still use Netscape and don't worry about the few I miss.
> Anyway, I tried the above command. Perhaps I'm missing something, but
> it prints the first field and the second field, but I just wanted the
> second field to be printed.
Yeah, you're right. It should've been:
awk "BEGIN{FS = \" \"}!date[$2]++{print $2}"
I just noticed that you're setting the FS to it's default value, a
single space, so you don't actually need that BEGIN section:
awk "!date[$2]++{print $2}"
> I don't know why, but it was actually slightly slower than your:
>
> awk "BEGIN{ FS=\" \" } { array[$2]++ } END{ for ( i in array ) print
> i}"
>
That's very surprising since it's avoiding the loop and array indexing.
When I ran both commands on a file that was 100000 lines long, I got
these results:
PS1> time gawk '!array[$2]++{print $2}'
real 0m1.67s
user 0m1.25s
sys 0m0.14s
PS1> time gawk '{ array[$2]++ } END{ for ( i in array ) print i}'
real 0m1.28s
user 0m1.09s
sys 0m0.15s
Beats me....
Ed.
| |
| Loki Harfagr 2005-04-08, 3:56 pm |
| Le Fri, 08 Apr 2005 08:50:47 -0500, Ed Morton a écrit_:
>
>
> Jonny wrote:
>
....[color=darkred]
>
> Yeah, you're right. It should've been:
>
> awk "BEGIN{FS = \" \"}!date[$2]++{print $2}"
>
> I just noticed that you're setting the FS to it's default value, a
> single space, so you don't actually need that BEGIN section:
>
> awk "!date[$2]++{print $2}"
>
>
> That's very surprising since it's avoiding the loop and array indexing.
> When I ran both commands on a file that was 100000 lines long, I got
> these results:
>
> PS1> time gawk '!array[$2]++{print $2}'
>
> real 0m1.67s
> user 0m1.25s
> sys 0m0.14s
>
> PS1> time gawk '{ array[$2]++ } END{ for ( i in array ) print i}'
>
> real 0m1.28s
> user 0m1.09s
> sys 0m0.15s
>
> Beats me....
>
> Ed.
Yes, I get the same order of speed ununderstanding
on a big file.
$ time awk '!date[$2]++{print $2}' testfileHUGER
real 0m0.606s
user 0m0.433s
sys 0m0.085s
$ wc testfileHUGER
360000 1536000 15024000 testfileHUGER
Another thing I *don't* understand here is
the usability of doublequotes and the bang!
and/or the $
In bash, csh, ksh here I can(t get them to work out but
these errors (I'd have promised) :
$ awk "!date[$2]++{print $2}" testfile
bash: !date[$2]++{print: event not found
$ csh
% awk "!date[$2]++{print $2}" testfile
date[: Event not found.
%exit
$ ksh
u@h:w$ awk "!date[$2]++{print $2}" testfile
awk: !date[]++{print }
awk: ^ syntax error
awk: Fatal: sous-expression invalide
Would you explain slowly why it works in your environments ?~D)
(whether I put LC_ALL to C or fr_FR it's still the same
on mine)
| |
| Ed Morton 2005-04-08, 3:56 pm |
| NNTP-Posting-Host: morton-2.ih.lucent.com
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.2) Gecko/20040804 Netscape/7.2 (ax)
X-Accept-Language: en-us, en
In-Reply-To: <425695f5$0$28392$626a14ce@news.free.fr>
Xref: newsfeed-west.nntpserver.com comp.lang.awk:25834
Loki Harfagr wrote:
<snip>
> Another thing I *don't* understand here is
> the usability of doublequotes and the bang!
> and/or the $
"!" is the "not" operator. "$" dereferences an argument. Use of double
quotes instead of single is, I believe, a windows thing. I never use it
and wouldn't expect it to work on any UNIX environment.
Ed.
| |
| Loki Harfagr 2005-04-08, 8:56 pm |
| Le Fri, 08 Apr 2005 12:59:47 -0500, Ed Morton a écrit_:
>
>
> Loki Harfagr wrote:
> <snip>
>
> "!" is the "not" operator. "$" dereferences an argument.
Well, you know I know this :-)
> Use of double
> quotes instead of single is, I believe, a windows thing.
Allright ! I guess you found out the very trick !
I feel really sorry having posted such a dull question
without even thinking about it :D)
> I never use it
Well, I bet I never even tried to, before these posts ...
> and wouldn't expect it to work on any UNIX environment.
I confirm it *does* **not** work
> Ed.
Thanks for the relief Ed. I even thought I was going blind,
might be this case of conjunctivitis I got these days.
Woah, Windows, never got the idea it could've been the point;
I should've pointed out that "Noworyta" is a win32 app ...
I guess I'm gonna have a better w end now :D)
Cheers indeed.
Have a nice w end too, Ed.
PS: as a complement for previous posts about speed and
perfs, I "awka'd" to C the scripts and the same rates of differences
appear against all odds.
Next step (when/if I find some time not writing
stoopeedeeteez on Usenet) would be I'd write a sharpened C version of it ...
We'll see :-=)
| |
| Ed Morton 2005-04-10, 3:55 am |
|
Jonny wrote:
> Ed Morton wrote:
>
>
>
>
> Hmm. I don't have that posting in my list, and the postings in my
> newsreader are sorted by date. Strange.
Happens to me using Netscape too. google groups has a lot of problems
but at least it seems to catch all the postings! Having said that, I
still use Netscape and don't worry about the few I miss.
> Anyway, I tried the above command. Perhaps I'm missing something, but
> it prints the first field and the second field, but I just wanted the
> second field to be printed.
Yeah, you're right. It should've been:
awk "BEGIN{FS = \" \"}!date[$2]++{print $2}"
I just noticed that you're setting the FS to it's default value, a
single space, so you don't actually need that BEGIN section:
awk "!date[$2]++{print $2}"
> I don't know why, but it was actually slightly slower than your:
>
> awk "BEGIN{ FS=\" \" } { array[$2]++ } END{ for ( i in array ) print
> i}"
>
That's very surprising since it's avoiding the loop and array indexing.
When I ran both commands on a file that was 100000 lines long, I got
these results:
PS1> time gawk '!array[$2]++{print $2}'
real 0m1.67s
user 0m1.25s
sys 0m0.14s
PS1> time gawk '{ array[$2]++ } END{ for ( i in array ) print i}'
real 0m1.28s
user 0m1.09s
sys 0m0.15s
Beats me....
Ed.
|
|
|
|
|