For Programmers: Free Programming Magazines  


Home > Archive > AWK > October 2006 > merge two files in awk









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author merge two files in awk
amrita.ray@gmail.com

2006-10-30, 7:01 pm

Hi,
I have two files (file 1 has one column and file 2 four columns), I
have choose the rows of file 2 where column 2 & 3 of file 2 matches
with column 1 of file 1. Anybody has any idea?
Thanks.

mainak.sen@gmail.com

2006-10-30, 7:01 pm

To add to this with an example,

file 1 :
1_8
1_9
1_10
1_11


file 2 :
1 1_500 1_600 0.000 1.0 0.0 0.0
1 1_500 1_500 0.000 0.0 0.0 1.0
1 1_9 1_100 0.000 0.50000 0.50000 0.00000
1 1_9 1_200 0.000 0.50000 0.50000 0.00000
1 1_9 1_400 0.000 1.0 0.0 0.0
.....
1 1_8 1_500 2.107 0.59766 0.40234 0.00000
1 1_8 1_9 2.107 0.89431 0.10569 0.00000
1 1_8 1_300 2.107 0.0 1.0 0.0


merge two files such that it will print
1 1_8 1_9 2.107 0.89431 0.10569 0.00000

i.e. the rows of file 2 where col.2 and col.3 matches with any two
entries in file 1


amrita.ray@gmail.com wrote:
> Hi,
> I have two files (file 1 has one column and file 2 four columns), I
> have choose the rows of file 2 where column 2 & 3 of file 2 matches
> with column 1 of file 1. Anybody has any idea?
> Thanks.


Vassilis

2006-10-30, 7:01 pm

Please don't top post. Corrected below

mainak.sen@gmail.com wrote:[color=darkred]
> Hi,
> I have two files (file 1 has one column and file 2 four columns), I
> have choose the rows of file 2 where column 2 & 3 of file 2 matches
> with column 1 of file 1. Anybody has any idea?
> Thanks.

awk 'NR == FNR { col[$0]++ }
$2 in col && $3 in col' file1 file2

William James

2006-10-30, 7:01 pm

You already got your answer in comp.unix.shell.

amrita.ray@gmail.com

2006-10-30, 7:01 pm

Yes, the answer:
awk 'NR==FNR {s[$1]} NR!=FNR && ($2 in s) && ($3 in s)' file1 file2
Thanks.


William James wrote:
> You already got your answer in comp.unix.shell.


Ed Morton

2006-10-30, 7:01 pm

amrita.ray@gmail.com wrote:
> Yes, the answer:
> awk 'NR==FNR {s[$1]} NR!=FNR && ($2 in s) && ($3 in s)' file1 file2


That's the wrong answer. Check the others you got.

Ed.

>
> William James wrote:
>
>
>

Janis Papanagnou

2006-10-30, 7:01 pm

Ed Morton wrote:
> amrita.ray@gmail.com wrote:
>
>
> That's the wrong answer. Check the others you got.


What's wrong with it? In c.u.s the OP said it works.

Janis
[color=darkred]
>
> Ed.
>
Ed Morton

2006-10-30, 7:01 pm

Janis Papanagnou wrote:
> Ed Morton wrote:
>
>
>
> What's wrong with it? In c.u.s the OP said it works.


It has 2 tests instead of one so it's less efficient and more
complicated than it has to be. The right answer is:

awk 'NR==FNR {s[$1]; next} ($2 in s) && ($3 in s)' file1 file2

Ed.
Janis Papanagnou

2006-10-30, 7:01 pm

Ed Morton wrote:
> Janis Papanagnou wrote:
>
> It has 2 tests instead of one so it's less efficient and more
> complicated than it has to be.


I wouldn't call that wrong, just different. Efficiency? - Maybe; I
think any difference is of little relevance here (may even depend
on how sophisticated the awk interpreter cares about optimization).
Nevermind.

But personally I think that breaking awk's natural parse sequence
by using 'next' is more "complicated" than guarding the conditions
a'la Dijkstra's if-guards.

But I wouldn't call any of the two proposed one liners complicated,
anyway, as I wouldn't call any of the two solutions "wrong".

Janis

> The right answer is:
>
> awk 'NR==FNR {s[$1]; next} ($2 in s) && ($3 in s)' file1 file2
>
> Ed.

Ed Morton

2006-10-30, 7:01 pm

Janis Papanagnou wrote:
> Ed Morton wrote:
>
>
>
> I wouldn't call that wrong, just different.


I would call it wrong because in addition to the above it's not
extensible. Let's say you want to do other things with the file2
records. Would you then do this:

awk '
NR==FNR {s[$1]}
NR!=FNR && ($2 in s) && ($3 in s) { print }
NR!=FNR && theSkyIsGrey { ... }
NR!=FNR && scotlandWinsWorldCup { ... }
NR!=FNR && endOfWorldArrives { ... }
' file1 file2

Rather than this:

awk '
NR==FNR {s[$1]; next}
($2 in s) && ($3 in s) { print }
theSkyIsGrey { ... }
scotlandWinsWorldCup { ... }
endOfWorldArrives { ... }
' file1 file2

Yes, the first version will work, but I'd be surprised if anyone
advocated doing it that way. Also, as a Scot, I suspect that second from
last condition will unfortunately never be true....

Regards,

Ed.
Vassilis

2006-10-30, 7:01 pm


<OT>
Ed Morton wrote:
> awk '
> NR==FNR {s[$1]; next}
> ($2 in s) && ($3 in s) { print }
> theSkyIsGrey { ... }
> scotlandWinsWorldCup { ... }
> endOfWorldArrives { ... }
> ' file1 file2
>
> Yes, the first version will work, but I'd be surprised if anyone
> advocated doing it that way. Also, as a Scot, I suspect that second from
> last condition will unfortunately never be true....
>
> Regards,
>
> Ed.


Cheer up, mate. Greece has won Euro2004.
Impossible is nothing.
I hear Scotland has some team these days.
</OT>

Janis Papanagnou

2006-10-30, 7:01 pm

Ed Morton wrote:
> Janis Papanagnou wrote:
>
> I would call it wrong because in addition to the above it's not
> extensible. Let's say you want to do other things with the file2
> records. Would you then do this:
>
> awk '
> NR==FNR {s[$1]}
> NR!=FNR && ($2 in s) && ($3 in s) { print }
> NR!=FNR && theSkyIsGrey { ... }
> NR!=FNR && scotlandWinsWorldCup { ... }
> NR!=FNR && endOfWorldArrives { ... }
> ' file1 file2
>
> Rather than this:
>
> awk '
> NR==FNR {s[$1]; next}
> ($2 in s) && ($3 in s) { print }
> theSkyIsGrey { ... }
> scotlandWinsWorldCup { ... }
> endOfWorldArrives { ... }
> ' file1 file2


I would have done exactly the same as you _in this case_, using 'next'.

But entensibility is a multifold (and here an academic?) argument.
If you want to extend your program _in a different way_, say...

NR==FNR {s[$1]}
NR!=FNR && ($2 in s) && ($3 in s) { print }
otherConditionForAllFiles1 { ... }
otherConditionForAllFiles2 { ... }
otherConditionForAllFilesN { ... }
{ ...}

....where the action code in the otherConditionForAllFiles<i> depends on
status data set by any of the first two cases, say s[], 'next' would not
be helpful. (And this extension is just one other example (of many).)

A 'next' breaks native control flow. It's an optimization command, IMO,
as is a continue, break, or goto in other languages. And sometimes it
makes code even more readable/comprehensible/maintainable. Sometimes.
Sometimes not.

> Yes, the first version will work, but I'd be surprised if anyone
> advocated doing it that way.


Still advocating it, since the conditions are clearer.

(Though still saying, in a one-liner like these, the difference is of
little relevance.)

> Also, as a Scot, I suspect that second from
> last condition will unfortunately never be true....


There are many ways to reach a goal; in awk as well as in football/soccer.
:-)

Janis

> Regards,
>
> Ed.

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com