For Programmers: Free Programming Magazines  


Home > Archive > AWK > March 2006 > Delete records with two or more identical fields









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Delete records with two or more identical fields
Jonny

2006-03-28, 3:56 am

Hi,

Please could you tell me how I could delete records which have two or
more identical fields.

For example:

112 544 667
233 654 233
786 786 897
234 546 877
102 548 548

would give:

112 544 667
234 546 877

Thanks for your help.

Regards,
Jonny
Loki Harfagr

2006-03-28, 6:56 pm

Le Tue, 28 Mar 2006 08:54:35 +0000, Jonny a écrit_:

> Hi,
>
> Please could you tell me how I could delete records which have two or
> more identical fields.
>
> For example:
>
> 112 544 667
> 233 654 233
> 786 786 897
> 234 546 877
> 102 548 548
>
> would give:
>
> 112 544 667
> 234 546 877
>
> Thanks for your help.
>
> Regards,
> Jonny


I added a test case you didn't include in your sample file :
$ cat MISCFILES/dupfields.txt
112 544 667
233 654 233
786 786 897
234 546 877
548 666 548
102 548 548

Then, con mucho "bad white magic" :
$ awk 'BEGIN{i=0} {for(j=1;j<=NF;j++){if(a[$j]){delete(a); next};a[$j]=1} ; v[++i]=$0; delete(a)} END{while(i){print v[i--]}}' MISCFILES/dupfields.txt
234 546 877
112 544 667

If you want to sort it, sort it :-)

And, Ed. please, don't shout too loud about the while in the END,
I had a hard w and then some ;-)
William James

2006-03-28, 6:56 pm

Jonny wrote:
> Hi,
>
> Please could you tell me how I could delete records which have two or
> more identical fields.
>
> For example:
>
> 112 544 667
> 233 654 233
> 786 786 897
> 234 546 877
> 102 548 548
>
> would give:
>
> 112 544 667
> 234 546 877
>
> Thanks for your help.
>
> Regards,
> Jonny


{ delete a
for (i=1; i<=NF; i++)
a[$i]
}
length(a) == NF

Janis Papanagnou

2006-03-28, 6:56 pm

William James wrote:
> Jonny wrote:
>
>
>
> { delete a
> for (i=1; i<=NF; i++)
> a[$i]
> }
> length(a) == NF
>


This looks nice, but neither did I find length() defined for arrays
in The AWK PL book nor in the GNU manual. And running that results in

fatal: attempt to use array `a' in a scalar context


Janis
Janis Papanagnou

2006-03-28, 6:56 pm

Jonny wrote:
> Hi,
>
> Please could you tell me how I could delete records which have two or
> more identical fields.
>
> For example:
>
> 112 544 667
> 233 654 233
> 786 786 897
> 234 546 877
> 102 548 548
>
> would give:
>
> 112 544 667
> 234 546 877
>
> Thanks for your help.
>
> Regards,
> Jonny


{ for (i=1; i<=NF; i++) if ($i in a) next; else a[$i]; delete a } 1


Janis
Loki Harfagr

2006-03-28, 6:56 pm

Le Tue, 28 Mar 2006 07:58:49 -0800, William James a écrit_:

> Jonny wrote:
>
> { delete a
> for (i=1; i<=NF; i++)
> a[$i]
> }
> length(a) == NF


Sir, I bow to you and clap my hands to your script !
William James

2006-03-28, 6:56 pm

Janis Papanagnou wrote:
> William James wrote:
>
> This looks nice, but neither did I find length() defined for arrays
> in The AWK PL book nor in the GNU manual. And running that results in
>
> fatal: attempt to use array `a' in a scalar context
>
>
> Janis


You must be one of those poor souls who are limited to
the crude software available on a Un*x platform. I understand
that some some Un*x users are so crippled that they cannot
even install new software such as Ruby!

The One True AWK (Brian Kernighan's awk95) is what I use,
since I am not handcuffed by Un*x. It lets you use length()
on arrays.

Janis Papanagnou

2006-03-28, 6:56 pm

William James wrote:
> Janis Papanagnou wrote:
>
>
> You must be one of those poor souls who are limited to


Why so hostile?

> the crude software available on a Un*x platform. I understand
> that some some Un*x users are so crippled that they cannot
> even install new software such as Ruby!


This is a very crude argument. And complete nonsense.

> The One True AWK (Brian Kernighan's awk95) is what I use,
> since I am not handcuffed by Un*x. It lets you use length()
> on arrays.


So you suggest using non-portable constructs? I thought so.

Thank's for pointing out what awk version is necessary for that,
as I said, nice construct. It would have been helpful to already
know of that restriction when reading your first posting. Would
have saved me from commenting on it.

And you mean that this version is not available for Unix'es?

Janis
Ed Morton

2006-03-28, 6:56 pm

William James wrote:
> Jonny wrote:
>
>
>
> { delete a
> for (i=1; i<=NF; i++)
> a[$i]
> }
> length(a) == NF
>


I haven't thought about this over much, but I think this will produce
the same effect without needing length() to work on arrays:

{ delete a; c=0
for (i=1; i<=NF; i++)
c += ++a[$i]
}
c == NF

Regards,

Ed.
Harlan Grove

2006-03-28, 6:56 pm

Janis Papanagnou wrote...
....
>{ for (i=1; i<=NF; i++) if ($i in a) next; else a[$i]; delete a } 1


For records when next gets called, the delete statement isn't run, so
there'd be erroneous entries in a when you begin processing the next
record. Isn't that a bug?

Myself, I don't care whether a is stuffed full after the last record
has been processed, so I'd put the delete statement before the for
loop, and if that caused particular awk implementations trouble, I'd
set a[1] in BEGIN. Then again, arrays involve extra overhead relative
to simple string comparisons, so I'd be tempted to use simplistic brute
force.

{ for (i = 1; i < NF; ++i) for (j = i + 1; j <= NF; ++j) if ($i == $j)
next; print }

Janis Papanagnou

2006-03-28, 6:56 pm

Harlan Grove wrote:
> Janis Papanagnou wrote...
> ...
>
>
>
> For records when next gets called, the delete statement isn't run, so
> there'd be erroneous entries in a when you begin processing the next
> record. Isn't that a bug?


Yes, I think you are right. That should teach me to use proper
formatting next time which would have made it more apparent. :-}

Thank's for correcting me!

> Myself, I don't care whether a is stuffed full after the last record
> has been processed, so I'd put the delete statement before the for
> loop, and if that caused particular awk implementations trouble, I'd
> set a[1] in BEGIN. Then again, arrays involve extra overhead relative
> to simple string comparisons, so I'd be tempted to use simplistic brute
> force.
>
> { for (i = 1; i < NF; ++i) for (j = i + 1; j <= NF; ++j) if ($i == $j)
> next; print }



Maybe a variant that I had in mind when I posted the above would help...

{ for (i=1; i<=NF; i++) if ((NR,$i) in a) next; else a[NR,$i]; } 1

This also omits the delete in exchange for some memory demands.

Janis
Harlan Grove

2006-03-28, 6:56 pm

Janis Papanagnou wrote...
....
>Maybe a variant that I had in mind when I posted the above would help...
>
> { for (i=1; i<=NF; i++) if ((NR,$i) in a) next; else a[NR,$i]; } 1
>
>This also omits the delete in exchange for some memory demands.


Possibly substantial memory demands for large files with few records
with duplicate fields. Also, as the array a become larger,
dereferencing particular entries becomes slower, especially if there
are many duplicate hash values that then require walking linked lists.
HUGE arrays are one of the things awk doesn't handle well. Depends on
circumstances.

gerryt@

2006-03-28, 6:56 pm


William James wrote:
> Janis Papanagnou wrote:
> You must be one of those poor souls who are limited to
> the crude software available on a Un*x platform. I understand
> that some some Un*x users are so crippled that they cannot
> even install new software such as Ruby!
>
> The One True AWK (Brian Kernighan's awk95) is what I use,
> since I am not handcuffed by Un*x. It lets you use length()
> on arrays.


Nice troll or forgery -not-
It IS available... As source. But I guess that's too 'advanced'
for some "un handcuffed" people : <

Janis Papanagnou

2006-03-28, 6:56 pm

Harlan Grove wrote:
> Also, as the array a become larger,
> dereferencing particular entries becomes slower, especially if there
> are many duplicate hash values that then require walking linked lists.
> HUGE arrays are one of the things awk doesn't handle well. Depends on
> circumstances.
>


Hmm, depends on the associative array implementation. I can think of
three apparent implementations; a) trees, b) hash arrays with linked
list entries, c) hash array with trees as entries. Those are the most
simple ones. I don't know what the typical awk implementations out
there use, but implementing a fixed table with sequential lists would
not be the best, IMO, since there is absolutely no information about
the input data characteristics. You mean, most awk's use variant b) ?

Janis
Jonny

2006-03-28, 6:56 pm

Jonny wrote:

> Please could you tell me how I could delete records which have two or
> more identical fields.
>
> For example:
>
> 112 544 667
> 233 654 233
> 786 786 897
> 234 546 877
> 102 548 548
>
> would give:
>
> 112 544 667
> 234 546 877


Thanks to everyone for replying. All of the solutions produced the
desired result with Win32 gawk. Using gawk, the speed was similar for
all of them on some large data files. But using Win32 mawk, Ed's
solution:

{ delete a; c=0; for (i=1; i<=NF; i++) c += ++a[$i]} c == NF

ran in half the time of the other one that worked with mawk (Janis'
solution):

{ for (i=1; i<=NF; i++) if ((NR,$i) in a) next; else a[NR,$i]; } 1

so I decided to go for Ed's.

Using mawk with:

{delete a; for (i=1; i<=NF; i++) a[$i]} length(a) == NF

gave:

mawk: line 1: illegal reference to array a

and with:

BEGIN{i=0} {for(j=1;j<=NF;j++){if(a[$j]){delete(a); next};a[$j]=1} ;
v[++i]=$0; delete(a)} END{while(i){print v[i--]}}

gave:

mawk: line 1: syntax error at or near (
mawk: line 1: syntax error at or near (

I don't know how the above one would perform with mawk if the syntax
error was fixed, but Ed's solution seems hard to beat with my
environment and data.

Thanks again to everyone for sharing your expertise.

Regards,
Jonny

Michael Zawrotny

2006-03-29, 6:56 pm

On Wed, 29 Mar 2006 00:29:03 GMT, Jonny <www.mail@ntlworld.com> wrote:
> Jonny wrote:
>
>
> Thanks to everyone for replying. All of the solutions produced the
> desired result with Win32 gawk. Using gawk, the speed was similar for
> all of them on some large data files. But using Win32 mawk, Ed's
> solution:
>
> { delete a; c=0; for (i=1; i<=NF; i++) c += ++a[$i]} c == NF
>
> ran in half the time of the other one that worked with mawk (Janis'
> solution):


For large data files, as you mention above, the solution below may
run faster. In particular if the number of fields per line is large,
the potential savings is greater because it will bail out a soon as
a repeat is detected.

{
delete a
for( i = 1 ; i <= NF; i++ ) {
a[$i]++;
if ( a[$i] > 1 ) {
next
}
}
print
}

On my sytem, this one ran about three times faster with mawk and five
times faster with gawk. That was with the above data pasted horizontally
16 times (so NF = 48) and concatenated 32K times (NR = 163840).


Mike

--
Michael Zawrotny
Institute of Molecular Biophysics
Florida State University | email: zawrotny@sb.fsu.edu
Tallahassee, FL 32306-4380 | phone: (850) 644-0069
Jonny

2006-03-29, 6:56 pm

Michael Zawrotny wrote:

> On Wed, 29 Mar 2006 00:29:03 GMT, Jonny <www.mail@ntlworld.com> wrote:
>
> For large data files, as you mention above, the solution below may
> run faster. In particular if the number of fields per line is large,
> the potential savings is greater because it will bail out a soon as
> a repeat is detected.
>
> {
> delete a
> for( i = 1 ; i <= NF; i++ ) {
> a[$i]++;
> if ( a[$i] > 1 ) {
> next
> }
> }
> print
> }
>
> On my sytem, this one ran about three times faster with mawk and five
> times faster with gawk. That was with the above data pasted horizontally
> 16 times (so NF = 48) and concatenated 32K times (NR = 163840).


Thanks Mike. Your solution is indeed much faster.

Regards,
Jonny

Harlan Grove

2006-03-29, 6:56 pm

Michael Zawrotny wrote...
....
>For large data files, as you mention above, the solution below may
>run faster. In particular if the number of fields per line is large,
>the potential savings is greater because it will bail out a soon as
>a repeat is detected.
>
>{
> delete a
> for( i = 1 ; i <= NF; i++ ) {
> a[$i]++;
> if ( a[$i] > 1 ) {
> next
> }
> }
> print
>}

....

This should run faster because you're bailing out with the next
statement, but you could shorten this a bit.

{
delete a
for( i = 1 ; i <= NF; i++ ) if (a[$i]++) next
print
}

Jonny

2006-03-29, 6:56 pm

Harlan Grove wrote:

> Michael Zawrotny wrote...
> ....
> ....
>
> This should run faster because you're bailing out with the next
> statement, but you could shorten this a bit.
>
> {
> delete a
> for( i = 1 ; i <= NF; i++ ) if (a[$i]++) next
> print
> }


Thanks Harlan. Yours is even faster still.

Regards,
Jonny


Grant

2006-03-29, 6:56 pm

On 29 Mar 2006 13:35:32 -0800, "Harlan Grove" <hrlngrv@aol.com> wrote:

>This should run faster because you're bailing out with the next
>statement, but you could shorten this a bit.
>
>{
> delete a
> for( i = 1 ; i <= NF; i++ ) if (a[$i]++) next
> print
>}


I was waiting for some golfer to come along and fix that ;)

Grant.
--
Memory fault -- brain fried
Michael Zawrotny

2006-03-30, 6:56 pm

Harlan Grove wrote:
> Michael Zawrotny wrote...
>
> ...
>
> This should run faster because you're bailing out with the next
> statement, but you could shorten this a bit.
>
> {
> delete a
> for( i = 1 ; i <= NF; i++ ) if (a[$i]++) next
> print
> }


Sure. I thought of that, but I was going for clarity. I don't really
like using the return value of variable autoincrement or decrement in
conditional tests. It always takes me a bit of extra thought to parse
it. Yours runs 10-20% faster than mine on my system and I would only
make that trade if the slightly reduced speed of mine was a deal
breaker.

That's personal taste, obviously, so YMMV.


Mike


--
Michael Zawrotny
Institute of Molecular Biophysics
Florida State University | email: zawrotny@sb.fsu.edu
Tallahassee, FL 32306-4380 | phone: (850) 644-0069
Harlan Grove

2006-03-30, 6:56 pm

Michael Zawrotny wrote...
>Harlan Grove wrote:
....[color=darkred]
....[color=darkred]
>
>Sure. I thought of that, but I was going for clarity. I don't really
>like using the return value of variable autoincrement or decrement in
>conditional tests. . . .

....
>That's personal taste, obviously, so YMMV.


I didn't make the changes I made in order to speed up you code. I did
it to improve readability for me. Myself, I find unnecessary braces
reduce readability. And then there's the ghastly unnecessary semicolon
statement terminator in your code (cringe!).

One person's clarity is another's obfuscated code. As you say, YMMV.

Michael Zawrotny

2006-03-30, 6:56 pm

Harlan Grove wrote:
> Michael Zawrotny wrote...
> ...
> ...
> ...
>
> I didn't make the changes I made in order to speed up you code. I did
> it to improve readability for me. Myself, I find unnecessary braces
> reduce readability.


Fair enough. I always put them in to start with so I don't have to
add them later when I add another line to the conditional or loop.
Once again, personal taste.

> And then there's the ghastly unnecessary semicolon
> statement terminator in your code (cringe!).


You got me on that one. Guilty as charged. My finger slipped.

> One person's clarity is another's obfuscated code. As you say, YMMV.


We're in perfect agreement there.


Mike

--
Michael Zawrotny
Institute of Molecular Biophysics
Florida State University | email: zawrotny@sb.fsu.edu
Tallahassee, FL 32306-4380 | phone: (850) 644-0069
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com