For Programmers: Free Programming Magazines  


Home > Archive > AWK > November 2006 > Yet another very basic sorting question









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Yet another very basic sorting question
Harriet Bazley

2006-11-22, 9:55 pm

I'm sure I'm being extremely stupid, but I have searched both Google's
archives of this newsgroup and the text version of the gawk manual, and
can't find a reference to what seems a very simple problem, that I've
been trying to solve for hours. I've tried copying arrays into arrays,
I've tried copying arrays to reverse the keys and values, I've tried
creating new indices from 1 to n, but I can't seem to get the result I
want....


I can use a very simple program to count the number of occurrences of a
given input line and print the number of unique lines and how often each
occurs:

{lines[$0]++}

END{
for (i in lines)
print lines[1],i
}


But what I can't see how to do is use asort() or asorti() to print out
these lines in order of frequency, given that the array values are *not*
guaranteed to be unique. So far as I can see it needs at least three
arrays, but I've tried all sorts of permutations and can't find out how
to retain *both* key and index values and output both in the order of
index magnitude.

In the end I had to use
printf("%.3d %s\n",lines[i],i)
and then use my editor's block sort to get the results in order, and its
wildcard search to remove the excess zeroes again, which is not exactly
an elegant solution. :-(

--
Harriet Bazley == Loyaulte me lie ==

If you're feeling good, don't worry. You'll get over it.
Grant

2006-11-22, 9:55 pm

On Thu, 23 Nov 2006 02:21:11 GMT, Harriet Bazley <bazley@feathermail.co.uk> wrote:

>I'm sure I'm being extremely stupid, but I have searched both Google's
>archives of this newsgroup and the text version of the gawk manual, and
>can't find a reference to what seems a very simple problem, that I've
>been trying to solve for hours. I've tried copying arrays into arrays,
>I've tried copying arrays to reverse the keys and values, I've tried
>creating new indices from 1 to n, but I can't seem to get the result I
>want....
>
>
>I can use a very simple program to count the number of occurrences of a
>given input line and print the number of unique lines and how often each
>occurs:
>
> {lines[$0]++}
>
>END{
> for (i in lines)
> print lines[1],i
> }


Something like:

for (i in lines)
sort_lines[++j] = sprintf("%06d %s", lines[i], i)

n = asort(sort_lines)

for (i = 1; i <= n; i++) {
split(sort_lines[i], k)
printf "%6d %s\n", k[1] k[2]
}

Grant.
--
http://bugsplatter.mine.nu/
Brian Inglis

2006-11-23, 3:56 am

On Thu, 23 Nov 2006 02:21:11 GMT in comp.lang.awk, Harriet Bazley
<bazley@feathermail.co.uk> wrote:

>I'm sure I'm being extremely stupid, but I have searched both Google's
>archives of this newsgroup and the text version of the gawk manual, and
>can't find a reference to what seems a very simple problem, that I've
>been trying to solve for hours. I've tried copying arrays into arrays,
>I've tried copying arrays to reverse the keys and values, I've tried
>creating new indices from 1 to n, but I can't seem to get the result I
>want....
>
>
>I can use a very simple program to count the number of occurrences of a
>given input line and print the number of unique lines and how often each
>occurs:
>
> {lines[$0]++}
>
>END{
> for (i in lines)
> print lines[1],i
> }
>
>
>But what I can't see how to do is use asort() or asorti() to print out
>these lines in order of frequency, given that the array values are *not*
>guaranteed to be unique. So far as I can see it needs at least three
>arrays, but I've tried all sorts of permutations and can't find out how
>to retain *both* key and index values and output both in the order of
>index magnitude.
>
>In the end I had to use
> printf("%.3d %s\n",lines[i],i)
>and then use my editor's block sort to get the results in order, and its
>wildcard search to remove the excess zeroes again, which is not exactly
>an elegant solution. :-(


Assuming Unix utilities:
sort in | uniq -c | sort +1nr
sorts file in, counts line occurrences and outputs count followed by
line, then sorts the output in reverse numeric order of the count

--
Thanks. Take care, Brian Inglis Calgary, Alberta, Canada

Brian.Inglis@CSi.com (Brian[dot]Inglis{at}SystematicSW[dot]a
b[dot]ca)
fake address use address above to reply
Ed Morton

2006-11-23, 6:56 pm

Harriet Bazley wrote:
> I'm sure I'm being extremely stupid, but I have searched both Google's
> archives of this newsgroup and the text version of the gawk manual, and
> can't find a reference to what seems a very simple problem, that I've
> been trying to solve for hours. I've tried copying arrays into arrays,
> I've tried copying arrays to reverse the keys and values, I've tried
> creating new indices from 1 to n, but I can't seem to get the result I
> want....
>
>
> I can use a very simple program to count the number of occurrences of a
> given input line and print the number of unique lines and how often each
> occurs:
>
> {lines[$0]++}
>
> END{
> for (i in lines)
> print lines[1],i


I assume you mean lines[i], not lines[1].

> }
>
>
> But what I can't see how to do is use asort() or asorti() to print out
> these lines in order of frequency, given that the array values are *not*
> guaranteed to be unique. So far as I can see it needs at least three
> arrays, but I've tried all sorts of permutations and can't find out how
> to retain *both* key and index values and output both in the order of
> index magnitude.
>
> In the end I had to use
> printf("%.3d %s\n",lines[i],i)
> and then use my editor's block sort to get the results in order, and its
> wildcard search to remove the excess zeroes again, which is not exactly
> an elegant solution. :-(
>


You may not need more arrays or to sort the array to get the output you
want. Try this:

{lines[$0]++}

END{
for (line in lines) {
count = lines[line]
allLines[count] = allLines[count] sep[count] line OFS count
sep[count] = ORS
max = (count > max ? count : max)
}
for (i=1; i<=max; i++)
if (i in allLines)
print allLines[i]
}

Obviusly you can aplit() allLines[] on RS before printing or use a
different separator or... but hopefully you get the idea.

Ed.
Harriet Bazley

2006-11-23, 9:55 pm

On 23 Nov 2006 as I do recall,
Grant wrote:

> On Thu, 23 Nov 2006 02:21:11 GMT, Harriet Bazley
> <bazley@feathermail.co.uk> wrote:


[snip output sort problems]


Typo - sorry!
[color=darkred]
>
> Something like:
>
> for (i in lines)
> sort_lines[++j] = sprintf("%06d %s", lines[i], i)
>

Ah... beautifully elegant :-D

And the daft thing is that it was more or less what I was doing
manually anyway, and I hadn't thought of it!

Thanks to all.

--
Harriet Bazley == Loyaulte me lie ==

Lies, damned lies and user documentation.
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com