For Programmers: Free Programming Magazines  


Home > Archive > AWK > February 2005 > Removing duplicates from within sections of a file









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Removing duplicates from within sections of a file
Jonny

2005-02-13, 8:55 pm

Hi,

I have a file which contains:

SECTION1
A
B
C
A
A
D
SECTION2
A
B
C
B
D
SECTION3
E
E
C
A
E

I would like to remove duplicates from within each section of the file.

So the above example would become:

SECTION1
A
B
C
D
SECTION2
A
B
C
D
SECTION3
E
C
A

The values within each section can be matched with a single regular
expression.

Please can you help me with this.

Regards,
Jonny

Janis Papanagnou

2005-02-13, 8:55 pm

Jonny wrote:
> Hi,
>
> I have a file which contains:
>
> SECTION1
> A
> B
> C
> A
> A
> D
> SECTION2
> A
> B
> C
> B
> D
> SECTION3
> E
> E
> C
> A
> E
>
> I would like to remove duplicates from within each section of the file.
>
> So the above example would become:
>
> SECTION1
> A
> B
> C
> D
> SECTION2
> A
> B
> C
> D
> SECTION3
> E
> C
> A
>
> The values within each section can be matched with a single regular
> expression.
>
> Please can you help me with this.


It's similar to the solution Ed already posted. But you need to
memorize the contents of data for each section, like in

if ( (sect,$0) in mem) next
else { mem[sect,$0] = 1; print ... }

You may also design it to save some memory space, just memorizing
the data for one section, and clear and reuse an array with every
new section.

Janis
Ed Morton

2005-02-13, 8:55 pm



Jonny wrote:
> Hi,
>
> I have a file which contains:
>
> SECTION1
> A
> B
> C
> A
> A
> D
> SECTION2
> A
> B
> C
> B
> D
> SECTION3
> E
> E
> C
> A
> E
>
> I would like to remove duplicates from within each section of the file.
>
> So the above example would become:
>
> SECTION1
> A
> B
> C
> D
> SECTION2
> A
> B
> C
> D
> SECTION3
> E
> C
> A
>
> The values within each section can be matched with a single regular
> expression.
>
> Please can you help me with this.


Try something like this:

gawk '/SECTION/{delete a}!($0 in a){print;a[$0]=""}' file

Ed.
Jürgen Kahrs

2005-02-13, 8:55 pm

Hello,

this one works (tested):

/^SECTION/ {
print $1
delete s
sec=$1
next
}

{
if ($1 in s)
next
print $1
s[$1] ++
}
Jonny

2005-02-13, 8:55 pm

Thanks for your replies.

They all work on the example I gave. I will need to test them on some
larger files later to see if there are any performance differences.

Your help is appreciated.

Regards,
Jonny
Buck Turgidson

2005-02-14, 3:55 pm

> gawk '/SECTION/{delete a}!($0 in a){print;a[$0]=""}' file


I am new to awk, and am trying to learn it. This could looks really
elegant, but I am having trouble understanding it. Would you be able to
tell me in pseudocode what it is doing?

I do know that it is testing to see if the current line contains "SECTION",
and if so it initializes an array called "a". I assume "a" contains the
words that you're trying to uniq. But after that I get a little lost. And
I don't see the keyword "in" in my gawk man page SuSE 9.1 Linux) where you
say "($0 in a)"

I'd be grateful if you could straighten me out.


Jürgen Kahrs

2005-02-14, 3:55 pm

Buck Turgidson wrote:
[color=darkred]
> I am new to awk, and am trying to learn it. This could looks really
> elegant, but I am having trouble understanding it. Would you be able to
> tell me in pseudocode what it is doing?


Elegant is an attribute that I would only use
when a solution is also easy to understand.

> I do know that it is testing to see if the current line contains "SECTION",
> and if so it initializes an array called "a". I assume "a" contains the
> words that you're trying to uniq. But after that I get a little lost. And
> I don't see the keyword "in" in my gawk man page SuSE 9.1 Linux) where you
> say "($0 in a)"


The gawk man page explains the "in" operator,
but it is hard to find the word "in":

The special operator in may be used in an if or while statement to see
if an array has an index consisting of a particular value.
if (val in array)
print array[val]
If the array has multiple subscripts, use (i, j) in array.
The in construct may also be used in a for loop to iterate over all the
elements of an array.
An element may be deleted from an array using the delete statement.
The delete statement may also be used to delete the entire contents of
an array, just by specifying the array name without a subscript.
Ed Morton

2005-02-14, 3:55 pm



Buck Turgidson wrote:
>
>
>
> I am new to awk, and am trying to learn it. This could looks really
> elegant, but I am having trouble understanding it. Would you be able to
> tell me in pseudocode what it is doing?
>
> I do know that it is testing to see if the current line contains "SECTION",
> and if so it initializes an array called "a". I assume "a" contains the
> words that you're trying to uniq. But after that I get a little lost. And
> I don't see the keyword "in" in my gawk man page SuSE 9.1 Linux) where you
> say "($0 in a)"
>
> I'd be grateful if you could straighten me out.


Some white-space and a couple of comments would probably help:

gawk '/SECTION/{delete a} # if the current line contains the word
# "SECTION", then delete (i.e. re-init)
# array "a" if it exists.

!($0 in a) # IF the string representing the current
# input record is NOT already an
# index for array a) THEN

{print; # print the current record
a[$0]=""} # and add it as an index to a, just
# so we can use the "in" operator later
# to test for a record already having
# been read.

' file

You can read up on the "in" operator at
http://www.gnu.org/software/gawk/ma...ce-to-Elements.
It is absolutely crucial to know about the subtleties of "in" (e.g. it
tests for but doesn't create an array index, unlike a["str"]=="") to do
awk programming.

The PDF file of the whole gawk user guide is at
http://www.gnu.org/software/gawk/manual/gawk.pdf and is well worth
printing out and reading.

Ed.
Jürgen Kahrs

2005-02-14, 3:55 pm

Ed Morton wrote:

> The PDF file of the whole gawk user guide is at
> http://www.gnu.org/software/gawk/manual/gawk.pdf and is well worth
> printing out and reading.


Well, such large manuals are sometimes
easier to handle in this format:

http://www.oreilly.com/catalog/awkprog3/index.html
Buck Turgidson

2005-02-14, 3:55 pm

Excellent. Very helpful - thanks.


Ian Stirling

2005-02-14, 3:55 pm

Buck Turgidson <jc_va@hotmail.com> wrote:
>
>
> I am new to awk, and am trying to learn it. This could looks really
> elegant, but I am having trouble understanding it. Would you be able to
> tell me in pseudocode what it is doing?
>
> I do know that it is testing to see if the current line contains "SECTION",
> and if so it initializes an array called "a". I assume "a" contains the
> words that you're trying to uniq. But after that I get a little lost. And
> I don't see the keyword "in" in my gawk man page SuSE 9.1 Linux) where you
> say "($0 in a)"


Look under Arrays.

$0 in a
basically is true, if $0 is an address in array a.
So, this
For every line containing SECTION, deletes the stored array.
For every line not a member of the array, it prints that line, then
adds it to the array.

So, given
SECTION
a
b
c
c
SECTION
c
d
e
a

It will print
a
b
c
c
d
e
a
Ed Morton

2005-02-14, 3:55 pm



Jürgen Kahrs wrote:
> Ed Morton wrote:
>
>
>
> Well, such large manuals are sometimes
> easier to handle in this format:
>
> http://www.oreilly.com/catalog/awkprog3/index.html


Cute. The book's 3 years out of date though wrt the current version of
gawk and the on-line document.

Ed
Aharon Robbins

2005-02-14, 3:55 pm

In article <fN2dnfL4Sso2Xo3fRVn-jQ@comcast.com>,
Ed Morton <morton@lsupcaemnt.com> wrote:
>Jürgen Kahrs wrote:
>
>Cute. The book's 3 years out of date though wrt the current version of
>gawk and the on-line document.
>
> Ed


It's not that out-of-date. Very very little has been added to the language
since that book was published. It's worth getting (a) because it's in
a nicer format, and (b) because it puts a few $$ in my pocket.

Thanks,

Arnold
--
Aharon (Arnold) Robbins --- Pioneer Consulting Ltd. arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381 Fax: +1 206 350 8765
Nof Ayalon Cell Phone: +972 50 729-7545
D.N. Shimshon 99785 ISRAEL
Ed Morton

2005-02-14, 3:55 pm



Aharon Robbins wrote:
> In article <fN2dnfL4Sso2Xo3fRVn-jQ@comcast.com>,
> Ed Morton <morton@lsupcaemnt.com> wrote:
>
>
>
> It's not that out-of-date.


The version Jurgen referred to is dated May 2001, whereas the on-line
one is June 2004 which is just over 3 years. Is there a newer release of
the book?

Very very little has been added to the language
> since that book was published. It's worth getting (a) because it's in
> a nicer format, and (b) because it puts a few $$ in my pocket.


Do you have a small document on-line somewhere that just describes the
updates since the book was published? I'd absolutely recommend people
buy the book if that document of deltas was also available as I
appreciate all the work that you've put into the tool and it's
documentation.

Ed.
Buck Turgidson

2005-02-14, 3:55 pm


"Ed Morton" <morton@lsupcaemnt.com> wrote in message
news:EaudnRltt4RVJo3fRVn-1A@comcast.com...
>
>
> Buck Turgidson wrote:
[color=darkred]
>
> You can read up on the "in" operator at
> http://www.gnu.org/software/gawk/ma...ce-to-Elements.
> It is absolutely crucial to know about the subtleties of "in" (e.g. it
> tests for but doesn't create an array index, unlike a["str"]=="") to do
> awk programming.




Did you mean to use the double equals above in your comment (unlike
a["str"]=="")), or is that just a typo?


Kenny McCormack

2005-02-14, 3:55 pm

In article <dsu6e2-lh4.ln1@turf.turgidson.com>,
Buck Turgidson <jc_va@hotmail.com> wrote:
....
>
>
>
>Did you mean to use the double equals above in your comment (unlike
>a["str"]=="")), or is that just a typo?


Yes, he does. No, it is not a typo.

Kenny McCormack

2005-02-14, 3:55 pm

In article <4210c8b3$1@news.012.net.il>,
Aharon Robbins <arnold@skeeve.com> wrote:
....
>It's not that out-of-date. Very very little has been added to the language
>since that book was published. It's worth getting (a) because it's in
>a nicer format, and (b) because it puts a few $$ in my pocket.


(I assume the book we are talking about is EAP. If not, please disregard
these comments)

It's a good book. There are obscurities of the language contained in the
book that you will never, ever get from reading documentation in electronic
form (it just being a fact of life that whenever you read documentation in
electronic form, you are always in "scan" mode).

This fact (that physical documentation on paper allows you to absorb things
that you can never get from browsing it in electronic form), is far more
important than any incremental changes that have occurred over the last
3 years.

In fact, besides "switch", what is there?

Ed Morton

2005-02-14, 3:55 pm



Buck Turgidson wrote:

> "Ed Morton" <morton@lsupcaemnt.com> wrote in message
> news:EaudnRltt4RVJo3fRVn-1A@comcast.com...
>
>
>
>
>
>
> Did you mean to use the double equals above in your comment (unlike
> a["str"]=="")), or is that just a typo?
>


Yes, I did meant to use them. This:

if ("str" in a) print

will only print if a["str"] exists, whereas this:

if (a["str"] != "") print

will only print if a["str"] is populated with something other than an
empty string but it additionally will add an entry to "a" indexed by
"str" e.g.:

PS1> gawk 'BEGIN{if ("str" in a) print; for (i in a) print i; exit}'
PS1> gawk 'BEGIN{if (a["str"] != "") print; for (i in a) print i; exit}'
str

Regards,

Ed.
Buck Turgidson

2005-02-14, 3:55 pm

> >
>
> Yes, he does. No, it is not a typo.
>


I asked because it differs from his original code. I thought he was
referencing the `a[$0]=""` part below where he uses the assignment operator.


gawk '/SECTION/{delete a}!($0 in a){print;a[$0]=""}' file


Kenny McCormack

2005-02-14, 3:55 pm

In article <9vv6e2-gj4.ln1@turf.turgidson.com>,
Buck Turgidson <jc_va@hotmail.com> wrote:
>
>I asked because it differs from his original code. I thought he was
>referencing the `a[$0]=""` part below where he uses the assignment operator.
>
>
>gawk '/SECTION/{delete a}!($0 in a){print;a[$0]=""}' file


Note, incidentally, that the more idiomatic way to express this is:

a[$0]++

(invoking {print} via the default action)

Buck Turgidson

2005-02-14, 3:55 pm

Thanks for your patience in educating me.


Jürgen Kahrs

2005-02-14, 3:55 pm

Ed Morton wrote:

> Do you have a small document on-line somewhere that just describes the
> updates since the book was published? I'd absolutely recommend people
> buy the book if that document of deltas was also available as I
> appreciate all the work that you've put into the tool and it's
> documentation.


There is no official delta-doc for the book AFAIK.
But the NEWS file from the source distribution could
be interesting for you. I have appended the relevant
part.

Ed Morton

2005-02-14, 3:55 pm



Kenny McCormack wrote:

> In article <9vv6e2-gj4.ln1@turf.turgidson.com>,
> Buck Turgidson <jc_va@hotmail.com> wrote:
>
>
>
> Note, incidentally, that the more idiomatic way to express this is:
>
> a[$0]++
>
> (invoking {print} via the default action)


Yup, you're right. This is the right way to do it:

gawk '/SECTION/{delete a}!a[$0]++' file

Thanks for catching it,

Ed.
Aharon Robbins

2005-02-15, 8:55 am

In article <7oedncumGuO0UI3fRVn-2w@comcast.com>,
Ed Morton <morton@lsupcaemnt.com> wrote:
>
>The version Jurgen referred to is dated May 2001, whereas the on-line
>one is June 2004 which is just over 3 years. Is there a newer release of
>the book?


No. It's not necessary right now.

> Very very little has been added to the language
>
>Do you have a small document on-line somewhere that just describes the
>updates since the book was published? I'd absolutely recommend people
>buy the book if that document of deltas was also available as I
>appreciate all the work that you've put into the tool and it's
>documentation.


There's the configure time code for switch (which I'd forgotten about),
and I think asorti(). As you saw in the NEWS file, it's mostly bug
fixes / performance tuning / POSIX compliance. Hmm, the ' flag to
printf showed up in 3.1.4. I'd be surprised if there are more than 5
in-the-language changes, and they're all very small things.

Arnold
--
Aharon (Arnold) Robbins --- Pioneer Consulting Ltd. arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381 Fax: +1 206 350 8765
Nof Ayalon Cell Phone: +972 50 729-7545
D.N. Shimshon 99785 ISRAEL
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com