For Programmers: Free Programming Magazines  


Home > Archive > AWK > February 2005 > Removing duplicate keys









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Removing duplicate keys
Richard Hassler

2005-02-10, 8:55 pm

(sorry if this is a double post)

File A (osuemail) is a data file with duplicate keys.

File B (dup.out) is a list file of only the duplicate keys.

My awk script is supposed to remove the duplicate keys from file A using
file B to tell it which ones to remove.

awk -F: '
BEGIN {
getline dup_ssn < "dup.out"
}

{
if (dup_ssn != $3) print "1", dup_ssn, $3, $8
else if (dup_ssn == $3 && (substr($8,1,1) ~ /C/ || substr($8,1,3) ==
500) ) {
#print "3"
getline dup_ssn < "dup.out"
}
else print "2", dup_ssn, $3, $8
}' /tmp/osuemail | wc -l

All files have been sorted in key sequence.

There are 276,990 records in file A.
There are 81 duplicates in file B.

After numerous adjustments to the awk script I still get a count of
276,970 in the new file C (not shown in the script).

Even this is not right because in analyzing the data in files A, B and C
I find that there are 43 duplicates not removed and 38 duplicates removed.

It looks like a sequencing problem to me but I supposedly took that out
of the equation by sorting all the files.

It found the first 8 duplicates.
Missed the next 2 duplicates.
Found the next duplicate.
Missed the next duplicate.
Found the next duplicate.
Missed the next duplicate.
Found the next duplicate.
Missed the next 3 duplicates.
Found the next 2 duplicates.
Missed the next duplicate.
Found the next 2 duplicates.
Missed the next 3 duplicates.
Found the next duplicate.
Missed the next duplicate.
Found the next duplicate.
Missed the next 3 duplicates.
Found the next duplicate.
Missed the next duplicate.
Found the next duplicate.
Missed the next 5 duplicates.
Found the next duplicate.
Missed the next 2 duplicates.
etc.

Is it a timing problem or what?
Ed Morton

2005-02-10, 8:55 pm



Richard Hassler wrote:

> (sorry if this is a double post)
>
> File A (osuemail) is a data file with duplicate keys.
>
> File B (dup.out) is a list file of only the duplicate keys.
>
> My awk script is supposed to remove the duplicate keys from file A using
> file B to tell it which ones to remove.
>
> awk -F: '
> BEGIN {
> getline dup_ssn < "dup.out"
> }

<snip>

Using "getline" is almost always the wrong solution.

It's hard to tell what your script is really trying to do, but if you
just want to discard all the records that have one of the dup keys in
field 3, you just need something like this (untested):

gawk 'NR==FNR{dups[$0]="";next}!($3 in dups)' dups.out /tmp/osuemail

Posting some sample input and expected output would help.

Regards,

Ed.
Jürgen Kahrs

2005-02-10, 8:55 pm

Richard Hassler wrote:

> File B (dup.out) is a list file of only the duplicate keys.
>
> My awk script is supposed to remove the duplicate keys from file A using
> file B to tell it which ones to remove.


Then you you have to read *all* duplicate keys
from file B before starting to read file A.

> awk -F: '
> BEGIN {
> getline dup_ssn < "dup.out"


Here you read only the first duplicate key.

> }
>
> {
> if (dup_ssn != $3) print "1", dup_ssn, $3, $8
> else if (dup_ssn == $3 && (substr($8,1,1) ~ /C/ || substr($8,1,3) ==
> 500) ) {
> #print "3"
> getline dup_ssn < "dup.out"
> }


Why do you read the rest of file B only conditionally ?
Ed Morton

2005-02-11, 8:55 pm



Ed Morton wrote:

<snip>
> Using "getline" is almost always the wrong solution.
>
> It's hard to tell what your script is really trying to do, but if you
> just want to discard all the records that have one of the dup keys in
> field 3, you just need something like this (untested):
>
> gawk 'NR==FNR{dups[$0]="";next}!($3 in dups)' dups.out /tmp/osuemail
>
> Posting some sample input and expected output would help.
>
> Regards,
>
> Ed


Richard - I got your email but generally won't respond to email about
technical questions related to a NG thread. Please post questions here
so others can also help and/or learn and/or correct repsonses.

Regards,

Ed.
Richard Hassler

2005-02-14, 3:55 pm

Your comment about 'getline' was interesting so I reread the O'Reilly
"sed & awk" book. I didn't find anything negative so I am curious as to
why you don't think it is the right solution. Of course something must
be wrong with it since my script isn't working right.

Even though my data file has duplicate keys the rest of the data in the
records is somewhat different and I want to be able to delete the right
duplicate.

Unfortunately I can't figure out your suggestion so I'm going to have to
do more reading and trying. I had tried the 'next' command but in
conjunction with the 'while' construct and never got it to do what I
thought it said it would do.

Then [$0] I'm assuming is an array.

And I'm completely baffled by the last two strings at the end although
it would appear to be accessing two files in this way.


Ed Morton wrote:

>
>
> Ed Morton wrote:
>
> <snip>
>
>
>
> Richard - I got your email but generally won't respond to email about
> technical questions related to a NG thread. Please post questions here
> so others can also help and/or learn and/or correct repsonses.
>
> Regards,
>
> Ed.

William James

2005-02-14, 8:55 pm

Richard Hassler wrote:
> Your comment about 'getline' was interesting so I reread the O'Reilly


> "sed & awk" book. I didn't find anything negative so I am curious as

to
> why you don't think it is the right solution. Of course something

must
> be wrong with it since my script isn't working right.
>


Ed is certainly right. Awk is able, ready, willing, and
eager to read lines for you automatically. It is unwise
to read lines manually unless there is a good reason.

> Even though my data file has duplicate keys the rest of the data in

the
> records is somewhat different and I want to be able to delete the

right
> duplicate.
>


You need to show us what the contents of your files
look like. And show the output you want.

> Unfortunately I can't figure out your suggestion so I'm going to have

to
> do more reading and trying. I had tried the 'next' command but in
> conjunction with the 'while' construct and never got it to do what I
> thought it said it would do.
>
> Then [$0] I'm assuming is an array.
>
> And I'm completely baffled by the last two strings at the end

although
> it would appear to be accessing two files in this way.
>


You read a book on Awk and you're baffled by filenames
on the command line that invokes Awk?


awk '
# Store the lines in the first file as keys
# in the associative array "dups".
# (When NR==FNR, the first file is being read.)
NR==FNR { dups[$0]=""; next }
# A line from the second file has been read.
# If field 3 was not in first file, print
# the line.
!($3 in dups)
' dups.out /tmp/osuemail


"dups.out" will be read first, followed by
"/tmp/osuemail".

Richard Hassler

2005-02-14, 8:55 pm

I'm probably wrong but I don't remember anything in the book that showed
_multiple_ input files and that was why I used getline. At least with
your comment I see what is meant by getline being a bad solution and I
agree.

I think I detected a slight 'dig' in your last sentence. Perhaps you
would like to enlighten me. With the reputation that O'Reilly books
have I assumed they would cover everything including processing multiple
files and in fact they did as regards what OS would handle how many open
files at one time.

While waiting for your response to this I will do some more reading.

William James wrote:

>Richard Hassler wrote:
>
>
>
>
>
>to
>
>
>must
>
>
>
>Ed is certainly right. Awk is able, ready, willing, and
>eager to read lines for you automatically. It is unwise
>to read lines manually unless there is a good reason.
>
>
>
>the
>
>
>right
>
>
>
>You need to show us what the contents of your files
>look like. And show the output you want.
>
>
>
>to
>
>
>although
>
>
>
>You read a book on Awk and you're baffled by filenames
>on the command line that invokes Awk?
>
>
>awk '
># Store the lines in the first file as keys
># in the associative array "dups".
># (When NR==FNR, the first file is being read.)
>NR==FNR { dups[$0]=""; next }
># A line from the second file has been read.
># If field 3 was not in first file, print
># the line.
>!($3 in dups)
>' dups.out /tmp/osuemail
>
>
>"dups.out" will be read first, followed by
>"/tmp/osuemail".
>
>
>

Ed Morton

2005-02-14, 8:55 pm



Richard Hassler wrote:
> Your comment about 'getline' was interesting so I reread the O'Reilly
> "sed & awk" book. I didn't find anything negative so I am curious as to
> why you don't think it is the right solution. Of course something must
> be wrong with it since my script isn't working right.


Here's what awk does by default:

WHILE read line
DO
set builtin variables
process users code
DONE

Incredibly simple. Here's what awk does when you add a getline:

WHILE read line
DO
set builtin variables
START processing users code
process some of users code
WHILE getline line
DO
set [some] builtin variables
process some more of users code
DONE
process rest of users code
END processing users code
DONE

Now not so simple. Add another getline or 2 and it's getting downright
complicated. So - let's say you want to print (or otherwise process)
every line. Where do you do that? The answer is clear and simple without
getline. With getline you have to think about it for a bit before
realising you need to duplicate your print command in 2 places. Let's
say you come back in a year and have to modify your print command or add
new functionality that you want hit for every record. Will you remember
to modify/add it in both places? In other words, it creates a
maintenance headache.

Now, let's say you start off using "getline" with no arguments. That'll
cause the builtin variables NR, FNR, NF, and $0 to be populated. If you
change that later to populate a variable it will NOT modify NF or $0. If
you change it later to read from a file, it won't modify NR or FNR
either, so if you wrote this code:

awk 'BEGIN{ while(getline){tmp=$0; print NR, NF, tmp} }' file

you'd get a print of:

<record number><number of fields in this record><this record>

If you change it to this apparently equivalent code:

awk 'BEGIN{ while(getline tmp){print NR, NF, tmp} }' file

you'd get a print of:

<record number><0><this record>

If you change it to this apparently equivalent code:

awk 'BEGIN{ while((getline tmp < "file") > 0){print NR, NF, tmp} }'

you'd get a print of:

<0><0><this record>

because getline from a file doesn't set NR either.

There's other non-obvious (but documented and useful in their context)
behaviors related to getline, all of which lead to the statement in the
gawk users guide
(http://www.gnu.org/software/gawk/ma...wk.html#Getline) that "The
getline command is used in several different ways and should not be used
by beginners ... come back and study the getline command after you have
reviewed the rest of this Web page and have a good knowledge of how awk
works.".

There are rare occasions when getline is the right solution, but you
have to really consider your options before deciding getline is the
right one.

Unfortunately, for most of us used to procedural programming, getline
lets us just crank out code the way we're used to so until we make the
necessary paradigm shift we'll be missing a lot of simpler, more elegant
and idiomatic solutions.

> Even though my data file has duplicate keys the rest of the data in the
> records is somewhat different and I want to be able to delete the right
> duplicate.
>
> Unfortunately I can't figure out your suggestion so I'm going to have to
> do more reading and trying. I had tried the 'next' command but in
> conjunction with the 'while' construct and never got it to do what I
> thought it said it would do.


Just show us some examples or ask specific questions and we'll help
point you in the right direction.

> Then [$0] I'm assuming is an array.


This is probably what you mean, but just to be sure: [$0] is an array
index, "dups" in my example is the array.

> And I'm completely baffled by the last two strings at the end although
> it would appear to be accessing two files in this way.


Yes, it's accessing 2 files and I'm using the difference between NR and
FNR builtin variables to tell which one I'm accessing. There's also a
FILENAME variable and other ways you can do that, but NR==FNR is the
typical one (and, by the way, FILENAME is another variable that has some
quirks related to using getline).

Ed.
[color=darkred]
>
> Ed Morton wrote:
>
Jürgen Kahrs

2005-02-14, 8:55 pm

Richard Hassler wrote:

> I'm probably wrong but I don't remember anything in the book that showed
> _multiple_ input files and that was why I used getline.


The gawk man page says:

gawk executes the code in the BEGIN block(s) (if any),
and then proceeds to read each file named in the ARGV array.
If there are no files named on the command line, gawk reads
the standard input.

So, it is clear that more than one file will be processed.
But you were right in pointing out that it is hard to find
explicite mentioning of the fact that you can pass many file
names in the command line.

Apart from this, it seems to be a widespread misconception
that AWK in general cannot process more than one file.
Last w someone told me that he had ported some AWK
scripts to Perl because he thought AWK cannot process
more than one file at a time. This is not a joke.
William James

2005-02-14, 8:55 pm

Richard Hassler wrote:
> With the reputation that O'Reilly books
> have I assumed they would cover everything including processing

multiple
> files and in fact they did as regards what OS would handle how many

open
> files at one time.


>From the Mawk manual, starting at line 11:


-----
SYNOPSIS
mawk [-W option] [-F value] [-v var=value] [--] 'program
text' [file ...]
mawk [-W option] [-F value] [-v var=value] [-f program-file]
[--] [file ...]
-----

"[file ...]" indicates that more than one file can be
given on the command line.

Skipping to line 890:

-----
13. Program execution
This section describes the order of program execution.
First ARGC is set to the total number of command line argu-
ments passed to the execution phase of the program. ARGV[0]
is set the name of the AWK interpreter and ARGV[1] ...
ARGV[ARGC-1] holds the remaining command line arguments
exclusive of options and program source.
....
If ARGC equals 1, the input stream is set to stdin,
else the command line arguments ARGV[1] ... ARGV[ARGC-1]
are examined for a file argument.

The command line arguments divide into three sets: file
arguments, assignment arguments and empty strings "". An
assignment has the form var=string. When an ARGV[i] is
examined as a possible file argument, if it is empty it is
skipped; if it is an assignment argument, the assignment to
var takes place and i skips to the next argument; else
ARGV[i] is opened for input.
-----

William James

2005-02-14, 8:55 pm


Richard Hassler wrote:
> With the reputation that O'Reilly books
> have I assumed they would cover everything including processing

multiple
> files and in fact they did as regards what OS would handle how many

open
> files at one time.


For most people, reading a book strait through is not
the best way to learn a programming language. Start writing
small programs when you've read enough to get you started.
The details of the language will be more deeply planted in
your mind when you use it than when you read about it.
And things that are puzzling in the book can be made clear
by experiment.

William James

2005-02-14, 8:55 pm

I should have included this:

-----
When end of file occurs on the input stream, the remaining
command line arguments are examined for a file argument, and
if there is one it is opened, else the END pattern is con-
sidered matched and all END actions are executed.

Chris F.A. Johnson

2005-02-15, 3:56 am

On Mon, 14 Feb 2005 at 22:55 GMT, William James wrote:
>
> Richard Hassler wrote:
> multiple
> open
>
> For most people, reading a book strait through is not
> the best way to learn a programming language.


I always recommend skimming (not scanning, which is a close
scrutiny) a book cover to cover to get an overview of the
capabilities of the language. Don't expect to understand
everything, just get the gist of what can be done. Once you have
done that, you can look up the how when you need it.

> Start writing small programs when you've read enough to get you
> started. The details of the language will be more deeply planted in
> your mind when you use it than when you read about it. And things
> that are puzzling in the book can be made clear by experiment.


That's really the only way to learn a language. Try to write plain
code, and don't use clever constructions that incorporate multiple
tasks until you are comfortable with the language.

--
Chris F.A. Johnson http://cfaj.freeshell.org/shell
========================================
===========================
My code (if any) in this post is copyright 2005, Chris F.A. Johnson
and may be copied under the terms of the GNU General Public License
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com