For Programmers: Free Programming Magazines  


Home > Archive > AWK > January 2008 > Submitted for your review: random sampling filters in gawk









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Submitted for your review: random sampling filters in gawk
steven.huwig@gmail.com

2008-01-03, 3:58 am

After becoming a little curious about random sampling and how to do it
without knowing the number of records in advance, I found Waterman's
Algorithm R (by way of Knuth's TAOCP Vol. 2), and Vitter's Algorithm Z
at <http://doi.acm.org/10.1145/3147.3165>.

The nice features of these algorithms are that they take space
proportional to the sample size, and that they work without scanning
the entire file to determine how many records are present. I think
they are appropriate to be implemented as Unix stdin/stdout pipeline
filters.

I wrote both versions up in gawk and figured I'd post them somewhere
where people can poke holes in them, use them, or change them if they
like.

I know that there are a lot of useless comparisons because I used
patterns instead of explicit loops and getline for flow control, but
it's just so darn readable for me this way. :-)

On the other hand, the variable names are not readable. I took them
directly from the algorithm pseudocode to make checking the
implementation a bit easier.

My implementation preserves the input order when it outputs the random
sample. I think this behavior is more useful in shell, where you work
with things like uniq or join that require the inputs to have some
sort of order.

Here's a comparison timing on my system.

scully:~/bin steve$ time bzcat words.bz2 | vitter-sample.awk -v
n=10000 > /dev/null

real 0m54.739s
user 0m28.725s
sys 0m10.722s
scully:~/bin steve$ time bzcat words.bz2 | waterman-sample.awk -v
n=10000 > /dev/null

real 1m35.501s
user 0m50.242s
sys 0m19.479s
scully:~/bin steve$ bzcat words.bz2 | wc -l
11746850

Comments are welcome.

-- Steve


# Waterman's Algorithm R for random sampling
# by way of Knuth's The Art of Computer Programming, volume 2

BEGIN {
if (!n) {
print "Usage: sample.awk -v n=[size]"
exit
}
t = n
srand()
}

NR <= n {
pool[NR] = $0
places[NR] = NR
next
}

NR > n {
t++
M = int(rand()*t) + 1
if (M <= n) {
READ_NEXT_RECORD(M)
}
}

END {
if (NR < n) {
print "sample.awk: Not enough records for sample" > "/dev/
stderr"
exit
}
# gawk needs a numeric sort function
# since it doesn't have one, zero-pad and sort alphabetically
pad = length(NR)
for (i in pool) {
new_index = sprintf("%0" pad "d", i)
newpool[new_index] = pool[i]
}
x = asorti(newpool, ordered)
for (i = 1; i <= x; i++)
newpool[ordered[i]]
}

function READ_NEXT_RECORD(idx) {
rec = places[idx]
delete pool[rec]
pool[NR] = $0
places[idx] = NR
}



# Vitter's Algorithm Z for random sampling
# http://doi.acm.org/10.1145/3147.3165 #

BEGIN {
if (!n) {
print "Usage: sample.awk -v n=[size]"
exit
}
T = 10 # tuneable per (Vitter 85)
srand()
nplusone = n + 1
}

NR == nplusone {
pool[NR] = $0
places[NR - 1] = NR
t = n
thresh = T*n
num = 0
}

NR > n && t > thresh {
if (!W) {
W = exp(-log(rand())/n)
term = t - n + 1
}
while (1) {
U = rand()
X = t*(W - 1.0)
S = int(X)
lhs = exp(log(((U*(((t + 1)/term)^2))*(term + S))/(t + X))/n)
rhs = (((t + X)/(term + S))*term)/t
if (lhs <= rhs) {
W = rhs/lhs
break
}
y = (((U*(t + 1))/term)*(t + S + 1))/(t + X)
if (n < S) {
denom = t
numer_lim = term + S
} else {
denom = t - n + S
numer_lim = t + 1
}
for (numer = t + S; numer > numer_lim; numer--) {
y = (y*numer)/denom
denom--
}
W = exp(-log(rand())/n)
if (exp(log(y)/n) <= (t + X)/t)
break
}
if (SKIP_RECORDS(S))
READ_NEXT_RECORD(int(n*rand()))
else
exit
t = t + S + 1
term = term + S + 1
next
}

NR > n && t <= thresh {
V = rand()
S = 0
t++
num++
quot = num/t
while (quot > V) {
S++
t++
num++
quot = (quot*num)/t
}
if (SKIP_RECORDS(S))
READ_NEXT_RECORD(int(n * rand()))
else
exit
next
}

NR <= n {
pool[NR] = $0
places[NR - 1] = NR
next
}

END {
if (NR < n) {
print "sample.awk: Not enough records for sample" > "/dev/
stderr"
exit
}
# gawk needs a numeric sort function
# since it doesn't have one, zero-pad and sort alphabetically
pad = length(NR)
for (i in pool) {
new_index = sprintf("%0" pad "d", i)
newpool[new_index] = pool[i]
}
x = asorti(newpool, ordered)
for (i = 1; i <= x; i++)
newpool[ordered[i]]
}

function SKIP_RECORDS(skip) {
retval = 1
for (i = 0; i < skip; i++)
retval = getline
return retval > 0
}

function READ_NEXT_RECORD(idx) {
rec = places[idx]
delete pool[rec]
pool[NR] = $0
places[idx] = NR
}

steven.huwig@gmail.com

2008-01-03, 3:58 am

I didn't use the Google Groups form correctly and the posted code was
not valid. Hopefully the below comes out right.


# Waterman's Algorithm R for random sampling
# by way of Knuth's The Art of Computer Programming, volume 2

BEGIN {
if (!n) {
print "Usage: sample.awk -v n=[size]"
exit
}
t = n
srand()
}

NR <= n {
pool[NR] = $0
places[NR] = NR
next
}

NR > n {
t++
M = int(rand()*t) + 1
if (M <= n) {
READ_NEXT_RECORD(M)
}
}

END {
if (NR < n) {
print "sample.awk: Not enough records for sample" \
> "/dev/stderr"

exit
}
# gawk needs a numeric sort function
# since it doesn't have one, zero-pad and sort alphabetically
pad = length(NR)
for (i in pool) {
new_index = sprintf("%0" pad "d", i)
newpool[new_index] = pool[i]
}
x = asorti(newpool, ordered)
for (i = 1; i <= x; i++)
print newpool[ordered[i]]
}

function READ_NEXT_RECORD(idx) {
rec = places[idx]
delete pool[rec]
pool[NR] = $0
places[idx] = NR
}


# Vitter's Algorithm Z for random sampling
# http://doi.acm.org/10.1145/3147.3165 #

BEGIN {
if (!n) {
print "Usage: sample.awk -v n=[size]"
exit
}
T = 10 # tuneable per (Vitter 85)
srand()
nplusone = n + 1
}

NR == nplusone {
pool[NR] = $0
places[NR - 1] = NR
t = n
thresh = T*n
num = 0
}

NR > n && t > thresh {
if (!W) {
W = exp(-log(rand())/n)
term = t - n + 1
}
while (1) {
U = rand()
X = t*(W - 1.0)
S = int(X)
lhs = exp(log(((U*(((t + 1)/term)^2))*(term + S))/(t + X))/n)
rhs = (((t + X)/(term + S))*term)/t
if (lhs <= rhs) {
W = rhs/lhs
break
}
y = (((U*(t + 1))/term)*(t + S + 1))/(t + X)
if (n < S) {
denom = t
numer_lim = term + S
} else {
denom = t - n + S
numer_lim = t + 1
}
for (numer = t + S; numer > numer_lim; numer--) {
y = (y*numer)/denom
denom--
}
W = exp(-log(rand())/n)
if (exp(log(y)/n) <= (t + X)/t)
break
}
if (SKIP_RECORDS(S))
READ_NEXT_RECORD(int(n*rand()))
else
exit
t = t + S + 1
term = term + S + 1
next
}

NR > n && t <= thresh {
V = rand()
S = 0
t++
num++
quot = num/t
while (quot > V) {
S++
t++
num++
quot = (quot*num)/t
}
if (SKIP_RECORDS(S))
READ_NEXT_RECORD(int(n * rand()))
else
exit
next
}

NR <= n {
pool[NR] = $0
places[NR - 1] = NR
next
}

END {
if (NR < n) {
print "sample.awk: Not enough records for sample" \
> "/dev/stderr"

exit
}
# gawk needs a numeric sort function
# since it doesn't have one, zero-pad and sort alphabetically
pad = length(NR)
for (i in pool) {
new_index = sprintf("%0" pad "d", i)
newpool[new_index] = pool[i]
}
x = asorti(newpool, ordered)
for (i = 1; i <= x; i++)
print newpool[ordered[i]]
}

function SKIP_RECORDS(skip) {
retval = 1
for (i = 0; i < skip; i++)
retval = getline
return retval > 0
}

function READ_NEXT_RECORD(idx) {
rec = places[idx]
delete pool[rec]
pool[NR] = $0
places[idx] = NR
}

William James

2008-01-03, 3:58 am

On Jan 2, 11:00 pm, "steven.hu...@gmail.com" <steven.hu...@gmail.com>
wrote:
> I didn't use the Google Groups form correctly and the posted code was
> not valid. Hopefully the below comes out right.
>
> # Waterman's Algorithm R for random sampling
> # by way of Knuth's The Art of Computer Programming, volume 2
>
> BEGIN {
> if (!n) {


Suggestion. Indent 2 spaces instead of 4.

[...]

> if (M <= n) {
> READ_NEXT_RECORD(M)
> }


This could be changed to
if (M <= n)
READ_NEXT_RECORD(M)

[...]

> x = asorti(newpool, ordered)
> for (i = 1; i <= x; i++)
> print newpool[ordered[i]]


This could be changed to

asorti(newpool, ordered)
for (i = 1; i in ordered; i++)
print newpool[ordered[i]]

Luuk

2008-01-03, 3:58 am


<steven.huwig@gmail.com> schreef in bericht
news:f31214a5-f659-4d9a-8133-a6f2f89b4d76@m34g2000hsf.googlegroups.com...
>I didn't use the Google Groups form correctly and the posted code was
> not valid. Hopefully the below comes out right.
>
>
> # Waterman's Algorithm R for random sampling
> # by way of Knuth's The Art of Computer Programming, volume 2
>
> BEGIN {
> if (!n) {
> print "Usage: sample.awk -v n=[size]"
> exit
> }
> t = n
> srand()
> }
>
> NR <= n {
> pool[NR] = $0
> places[NR] = NR
> next
> }
>
> NR > n {
> t++
> M = int(rand()*t) + 1
> if (M <= n) {
> READ_NEXT_RECORD(M)
> }
> }
>
> END {
> if (NR < n) {
> print "sample.awk: Not enough records for sample" \
> exit
> }
> # gawk needs a numeric sort function
> # since it doesn't have one, zero-pad and sort alphabetically
> pad = length(NR)
> for (i in pool) {
> new_index = sprintf("%0" pad "d", i)
> newpool[new_index] = pool[i]
> }
> x = asorti(newpool, ordered)
> for (i = 1; i <= x; i++)
> print newpool[ordered[i]]
> }
>
> function READ_NEXT_RECORD(idx) {
> rec = places[idx]
> delete pool[rec]
> pool[NR] = $0
> places[idx] = NR
> }
>
>
> # Vitter's Algorithm Z for random sampling
> # http://doi.acm.org/10.1145/3147.3165 #
>
> BEGIN {
> if (!n) {
> print "Usage: sample.awk -v n=[size]"
> exit
> }
> T = 10 # tuneable per (Vitter 85)
> srand()
> nplusone = n + 1
> }
>
> NR == nplusone {
> pool[NR] = $0
> places[NR - 1] = NR
> t = n
> thresh = T*n
> num = 0
> }
>
> NR > n && t > thresh {
> if (!W) {
> W = exp(-log(rand())/n)
> term = t - n + 1
> }
> while (1) {
> U = rand()
> X = t*(W - 1.0)
> S = int(X)
> lhs = exp(log(((U*(((t + 1)/term)^2))*(term + S))/(t + X))/n)
> rhs = (((t + X)/(term + S))*term)/t
> if (lhs <= rhs) {
> W = rhs/lhs
> break
> }
> y = (((U*(t + 1))/term)*(t + S + 1))/(t + X)
> if (n < S) {
> denom = t
> numer_lim = term + S
> } else {
> denom = t - n + S
> numer_lim = t + 1
> }
> for (numer = t + S; numer > numer_lim; numer--) {
> y = (y*numer)/denom
> denom--
> }
> W = exp(-log(rand())/n)
> if (exp(log(y)/n) <= (t + X)/t)
> break
> }
> if (SKIP_RECORDS(S))
> READ_NEXT_RECORD(int(n*rand()))
> else
> exit
> t = t + S + 1
> term = term + S + 1
> next
> }
>
> NR > n && t <= thresh {
> V = rand()
> S = 0
> t++
> num++
> quot = num/t
> while (quot > V) {
> S++
> t++
> num++
> quot = (quot*num)/t
> }
> if (SKIP_RECORDS(S))
> READ_NEXT_RECORD(int(n * rand()))
> else
> exit
> next
> }
>
> NR <= n {
> pool[NR] = $0
> places[NR - 1] = NR
> next
> }
>
> END {
> if (NR < n) {
> print "sample.awk: Not enough records for sample" \
> exit
> }
> # gawk needs a numeric sort function
> # since it doesn't have one, zero-pad and sort alphabetically
> pad = length(NR)
> for (i in pool) {
> new_index = sprintf("%0" pad "d", i)
> newpool[new_index] = pool[i]
> }
> x = asorti(newpool, ordered)
> for (i = 1; i <= x; i++)
> print newpool[ordered[i]]
> }
>
> function SKIP_RECORDS(skip) {
> retval = 1
> for (i = 0; i < skip; i++)
> retval = getline
> return retval > 0
> }
>
> function READ_NEXT_RECORD(idx) {
> rec = places[idx]
> delete pool[rec]
> pool[NR] = $0
> places[idx] = NR
> }
>


i tried something

$ wc -l american
212710 american

$ while true; do awk -v n=1 vitter-sample.awk american ; done
unpersonalized
unpersonalized
kerosene
kerosene
kerosene
kerosene
kerosene
Kiyoshi
Kiyoshi
Kiyoshi
Kiyoshi
Kiyoshi
Navratilova's
Navratilova's
Navratilova's
Navratilova's
Navratilova's
passport's

$ while true; do awk -v n=1 waterman-sample.awk american ; done
physio
physio
physio
AA
petulant
AA
petulant
AA
petulant
AA
petulant
AA
petulant
AA
petulant
AA
petulant
AA
petulant
AA
incineration's
AA
incineration's


At 1st i was amazed by the randomness of this script,
but i think i need an expert on randomness to tell why above is happening...
(but still, thanks for sharing the scripts!)



Janis Papanagnou

2008-01-03, 3:58 am

William James wrote:
> On Jan 2, 11:00 pm, "steven.hu...@gmail.com" <steven.hu...@gmail.com>
> wrote:
>
>
>
>
> This could be changed to
>
> asorti(newpool, ordered)
> for (i = 1; i in ordered; i++)
> print newpool[ordered[i]]
>


What would be the advantage? (Just saving a variable?)

I would assume that the 'in' test operator is of complexity O(log N),
i.e. a balanced tree lookup, so that the for loop would be O(N*log N),
while the simple scalar comparison that the OP used is just plain O(N).

The counted loop seems to be always preferable in case of contiguous
filled arrays.

Janis
steven.huwig@gmail.com

2008-01-03, 7:58 am

On Jan 3, 4:28=A0am, "Luuk" <l...@invalid.lan> wrote:

> At 1st i was amazed by the randomness of this script,
> but i think i need an expert on randomness to tell why above is happening.=

...
> (but still, thanks for sharing the scripts!)


srand() seeds the random number generator with the time in seconds, so
if your computer is fast enough that behavior of repeated selection
could happen. How does it work if you put a sleep 1 in your scripts?

On the other hand there looks to be an edge case with sample size n=3D1
as well. I would not be surprised at all. I can look into it a bit
more and post a fix if I find the bug in my implementation.

I think your script labels may be switched. I am seeing funny behavior
with the Vitter algorithm and sample size n=3D1. The simpler one seems
to be fine.

Thanks!

-- Steve



Luuk

2008-01-03, 7:58 am


<steven.huwig@gmail.com> schreef in bericht
news:0aac594c-3223-4281-9ad9-7e6f233adb2c@l6g2000prm.googlegroups.com...
On Jan 3, 4:28 am, "Luuk" <l...@invalid.lan> wrote:

> At 1st i was amazed by the randomness of this script,
> but i think i need an expert on randomness to tell why above is
> happening...
> (but still, thanks for sharing the scripts!)


srand() seeds the random number generator with the time in seconds, so
if your computer is fast enough that behavior of repeated selection
could happen. How does it work if you put a sleep 1 in your scripts?

On the other hand there looks to be an edge case with sample size n=1
as well. I would not be surprised at all. I can look into it a bit
more and post a fix if I find the bug in my implementation.

I think your script labels may be switched. I am seeing funny behavior
with the Vitter algorithm and sample size n=1. The simpler one seems
to be fine.

Thanks!

-- Steve


you're right, i switched the 2 versions...

you are probably right about the srand(), adding the 'sleep 1' solves the
problem

but the vitter-sample.awk with option 'n=1' sometimes returns 2 rows
'AA' is the second line in the file american

$ while true; do echo "==="; awk -v n=1 -f vitter-sample.awk american ;
sleep 1; done
===
sectionally
===
AA
tangram
===
Lissie's
===
Halvaard
===
Baber's
===
AA
earaches


Anton Treuenfels

2008-01-04, 7:58 am


"Janis Papanagnou" <Janis_Papanagnou@hotmail.com> wrote in message
news:fliad3$qvg$1@online.de...

> The counted loop seems to be always preferable in case of contiguous
> filled arrays.


It certainly is in TAWK V4. Loops of any size execute up to three times
faster while traversing such arrays. Though I had generally assumed that
this was due to the "in" operator triggering an array sort before beginning
the iteration.

- Anton Treuenfels


mjc

2008-01-04, 6:58 pm

On Jan 3, 12:25 am, William James <w_a_x_...@yahoo.com> wrote:

> This could be changed to
>
> asorti(newpool, ordered)
> for (i = 1; i in ordered; i++)
> print newpool[ordered[i]]

I don't know what the for loop as written would do. It might skip some
values, do some more than once, or even get into an infinite loop.

Use: for(i in ordered)
Ed Morton

2008-01-04, 6:58 pm



On 1/4/2008 2:12 PM, mjc wrote:
> On Jan 3, 12:25 am, William James <w_a_x_...@yahoo.com> wrote:
>
>
>
> I don't know what the for loop as written would do. It might skip some
> values, do some more than once, or even get into an infinite loop.


No, it wouldn't do any of that. It'd just loop through all the values in "ordered".

> Use: for(i in ordered)


That wouldn't preserve the order.

Ed.

Ed Morton

2008-01-04, 6:58 pm



On 1/2/2008 11:00 PM, steven.huwig@gmail.com wrote:
> I didn't use the Google Groups form correctly and the posted code was
> not valid. Hopefully the below comes out right.
>
>
> # Waterman's Algorithm R for random sampling
> # by way of Knuth's The Art of Computer Programming, volume 2
>
> BEGIN {
> if (!n) {
> print "Usage: sample.awk -v n=[size]"


Is zero a valid size? What about "seven"? You could tweak the above test to be a
bit more robust, e.g.:

if (!(n+0))

> exit
> }
> t = n
> srand()
> }
>
> NR <= n {
> pool[NR] = $0
> places[NR] = NR
> next


No need for a "next" above or no need for the opposite condition "NR>n" below.

> }
>
> NR > n {
> t++
> M = int(rand()*t) + 1
> if (M <= n) {
> READ_NEXT_RECORD(M)
> }
> }
>
> END {
> if (NR < n) {
> print "sample.awk: Not enough records for sample" \
> exit
> }
> # gawk needs a numeric sort function
> # since it doesn't have one, zero-pad and sort alphabetically
> pad = length(NR)


I may be having a dense moment, but I don't understand why you'd want to pad
each record to the size of the number of records in the file. Wouldn't you want
to instead pad to the length of the longest record?

Ed.

> for (i in pool) {
> new_index = sprintf("%0" pad "d", i)
> newpool[new_index] = pool[i]
> }
> x = asorti(newpool, ordered)
> for (i = 1; i <= x; i++)
> print newpool[ordered[i]]
> }
>
> function READ_NEXT_RECORD(idx) {
> rec = places[idx]
> delete pool[rec]
> pool[NR] = $0
> places[idx] = NR
> }
>
>
> # Vitter's Algorithm Z for random sampling
> # http://doi.acm.org/10.1145/3147.3165 #
>
> BEGIN {
> if (!n) {
> print "Usage: sample.awk -v n=[size]"
> exit
> }
> T = 10 # tuneable per (Vitter 85)
> srand()
> nplusone = n + 1
> }
>
> NR == nplusone {
> pool[NR] = $0
> places[NR - 1] = NR
> t = n
> thresh = T*n
> num = 0
> }
>
> NR > n && t > thresh {
> if (!W) {
> W = exp(-log(rand())/n)
> term = t - n + 1
> }
> while (1) {
> U = rand()
> X = t*(W - 1.0)
> S = int(X)
> lhs = exp(log(((U*(((t + 1)/term)^2))*(term + S))/(t + X))/n)
> rhs = (((t + X)/(term + S))*term)/t
> if (lhs <= rhs) {
> W = rhs/lhs
> break
> }
> y = (((U*(t + 1))/term)*(t + S + 1))/(t + X)
> if (n < S) {
> denom = t
> numer_lim = term + S
> } else {
> denom = t - n + S
> numer_lim = t + 1
> }
> for (numer = t + S; numer > numer_lim; numer--) {
> y = (y*numer)/denom
> denom--
> }
> W = exp(-log(rand())/n)
> if (exp(log(y)/n) <= (t + X)/t)
> break
> }
> if (SKIP_RECORDS(S))
> READ_NEXT_RECORD(int(n*rand()))
> else
> exit
> t = t + S + 1
> term = term + S + 1
> next
> }
>
> NR > n && t <= thresh {
> V = rand()
> S = 0
> t++
> num++
> quot = num/t
> while (quot > V) {
> S++
> t++
> num++
> quot = (quot*num)/t
> }
> if (SKIP_RECORDS(S))
> READ_NEXT_RECORD(int(n * rand()))
> else
> exit
> next
> }
>
> NR <= n {
> pool[NR] = $0
> places[NR - 1] = NR
> next
> }
>
> END {
> if (NR < n) {
> print "sample.awk: Not enough records for sample" \
> exit
> }
> # gawk needs a numeric sort function
> # since it doesn't have one, zero-pad and sort alphabetically
> pad = length(NR)
> for (i in pool) {
> new_index = sprintf("%0" pad "d", i)
> newpool[new_index] = pool[i]
> }
> x = asorti(newpool, ordered)
> for (i = 1; i <= x; i++)
> print newpool[ordered[i]]
> }
>
> function SKIP_RECORDS(skip) {
> retval = 1
> for (i = 0; i < skip; i++)
> retval = getline
> return retval > 0
> }
>
> function READ_NEXT_RECORD(idx) {
> rec = places[idx]
> delete pool[rec]
> pool[NR] = $0
> places[idx] = NR
> }
>


Kenny McCormack

2008-01-04, 6:58 pm

In article <0fa66663-05e3-430b-802d-caaa097340dc@y5g2000hsf.googlegroups.com>,
mjc <mjcohen@acm.org> wrote:
>On Jan 3, 12:25 am, William James <w_a_x_...@yahoo.com> wrote:
>
>I don't know what the for loop as written would do. It might skip some
>values, do some more than once, or even get into an infinite loop.
>
>Use: for(i in ordered)


As Ed notes, the above is perfectly well-defined, on the assumption that
the array indexes are known to be (exclusively) numeric and to start at 1
(and not be infinite).

Still, the above is odd syntax and definitely "looks wrong". But, a
clever workaround for when you are using a defective AWK that doesn't do
array sorting.

steven.huwig@gmail.com

2008-01-06, 3:58 am

On Jan 4, 4:45=A0pm, Ed Morton <mor...@lsupcaemnt.com> wrote:
> Is zero a valid size? What about "seven"? You could tweak the above test t=

o be a
> bit more robust, e.g.:
>
> =A0 =A0 =A0 =A0 if (!(n+0))


Good idea -- thanks!

>
> I may be having a dense moment, but I don't understand why you'd want to p=

ad
> each record to the size of the number of records in the file. Wouldn't you=

want
> to instead pad to the length of the longest record?


I'm padding/sorting the indices, not the input data. I want to output
the data in the same order it was received.

pad =3D length(NR)
for (i in pool) {
new_index =3D sprintf("%0" pad "d", i)
newpool[new_index] =3D pool[i]
}
x =3D asorti(newpool, ordered)
for (i =3D 1; i <=3D x; i++)
print newpool[ordered[i]]

The only way I could see to order it within gawk alone was to copy the
original "pool" array to "newpool", which uses the indices from "pool"
converted to a format that alphabetic sort would work on.

Haven't yet had time to look at the strange behavior Luuk saw... maybe
tomorrow.

-- Steve
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com