For Programmers: Free Programming Magazines  


Home > Archive > AWK > December 2006 > First Lines and Last Lines of a File









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author First Lines and Last Lines of a File
Antonio Dell'elce

2006-12-04, 7:56 am

Hi all,

when i need the first N lines of a file I use awk this way:

cat file | c=3 awk ' BEGIN { c=ENVIRON["c"] } NR <= c { print } '

(just an example)

but can I do when I need the n first lines of a file and the last n
lines of a file ?
I tought of keeping n lines in an array and remove the older line after
every new,
pratically a FIFO structure... but it begins to be a bit difficult...
anyone with
a quicker/easier one? (I know I could use a combination of "tail" &
"head".. but
I'd like to keep processing inside a single awk "statement".


Thanks,

Antonio

Thomas Weidenfeller

2006-12-04, 7:56 am

Antonio Dell'elce wrote:
> Hi all,
>
> when i need the first N lines of a file I use awk this way:
>
> cat file | c=3 awk ' BEGIN { c=ENVIRON["c"] } NR <= c { print } '


That one can be beautified, but lets leave it as it is. You want to have
a look at the -v assignment command line option, and avoid the useless
use of cat.

>
> (just an example)
>
> but can I do when I need the n first lines of a file and the last n
> lines of a file ?


For the last N lines:

{ lines[n++ % N] = $0 }
END {
for(i = n; i < n + N; i++) {
j = i % N
if(j in lines) {
print lines[j]
}
}
}


Note, the /if(j in lines)/ test is only there to handle the case when
the total number of input lines is < N. If that is guaranteed not to
happen you can simplify the loop.

> I tought of keeping n lines in an array and remove the older line after
> every new,
> pratically a FIFO structure... but it begins to be a bit difficult...


The above uses a ring buffer of N lines, just overwriting the oldest
line when it turns out there is another line available.

/Thomas
Ed Morton

2006-12-04, 9:56 pm

Thomas Weidenfeller wrote:

> Antonio Dell'elce wrote:
>
>
>
> That one can be beautified, but lets leave it as it is.


I tried but I can't - somebody help me....

> You want to have
> a look at the -v assignment command line option, and avoid the useless
> use of cat.


Right:

awk -v c="$c" 'NR<=c' file

>
>
> For the last N lines:
>
> { lines[n++ % N] = $0 }
> END {
> for(i = n; i < n + N; i++) {
> j = i % N
> if(j in lines) {
> print lines[j]
> }
> }
> }
>
>
> Note, the /if(j in lines)/ test is only there to handle the case when
> the total number of input lines is < N. If that is guaranteed not to
> happen you can simplify the loop.


Nice. Alternatively, parse the file twice:

awk 'NR==FNR{nr++}FNR>(nr-N)' file file

Regards,

Ed.
Antonio Dell'elce

2006-12-05, 7:57 am

Ed Morton wrote:
> Thomas Weidenfeller wrote:
>
>
> I tried but I can't - somebody help me....
>
>
> Right:
>
> awk -v c="$c" 'NR<=c' file
>
>
> Nice. Alternatively, parse the file twice:
>
> awk 'NR==FNR{nr++}FNR>(nr-N)' file file
>
> Regards,
>
> Ed.


I prefer the modulo version, parsing twice would not be good
performance-wise.

Thanks,
Antonio

Ed Morton

2006-12-05, 7:57 am

Antonio Dell'elce wrote:
> Ed Morton wrote:
>
>
>
> I prefer the modulo version, parsing twice would not be good
> performance-wise.


I haven't tested it, but parsing it twice will use less memory and I'd
expect it to be faster since array operations are typically slow. If you
care about tuning the performance, you could add ";next" after the
"nr++" and that MIGHT be faster still.

Ed.

Janis

2006-12-05, 7:57 am

Ed Morton wrote:
> Antonio Dell'elce wrote:

The two-pass solution is at least clearer and easier to comprehend,
and I'd say also less error prone if you want to extend it.
[color=darkred]
>
> I haven't tested it, but parsing it twice will use less memory and I'd
> expect it to be faster since array operations are typically slow. If you
> care about tuning the performance, you could add ";next" after the
> "nr++" and that MIGHT be faster still.


Hmm, I don't think that file I/O, which is usually slow compared to
in-memory operations (even if in the second pass the file is buffered
in some cache memory) would be faster.

Things might get worse if your values of N get really *huge* and array
operations quite costly; but that seems mostly not to be the case with
this type of problem, where one often just wants to strip just a couple
of lines.

Some benchmark numbers would be interesting.

Janis

>
> Ed.


Ed Morton

2006-12-05, 6:56 pm

Janis wrote:

> Ed Morton wrote:
>
<snip>[color=darkred]
>
>
> The two-pass solution is at least clearer and easier to comprehend,
> and I'd say also less error prone if you want to extend it.
>
>
>
>
> Hmm, I don't think that file I/O, which is usually slow compared to
> in-memory operations (even if in the second pass the file is buffered
> in some cache memory) would be faster.


I used to think I/O was slow too. I don't recall who it was or what the
reasoning was, but I remember someone who I thought knew what they were
talking about convincing me that multiple "print"s is faster than
building up a string and then printing it once so that changed my opinion.

> Things might get worse if your values of N get really *huge* and array
> operations quite costly; but that seems mostly not to be the case with
> this type of problem, where one often just wants to strip just a couple
> of lines.
>
> Some benchmark numbers would be interesting.


Here y'go (gawk 3.1.5 on Cygwin in Windows XP):

$ cat tst1
{ lines[n++ % N] = $0 }
END {
for(i = n; i < n + N; i++) {
j = i % N
if(j in lines) {
print lines[j]
}
}
}
$ wc -l file1*k
100000 file100k
10000 file10k
1000 file1k
111000 total
$ time awk -v N=20 -f tst1 file100k > /dev/null

real 0m0.742s
user 0m0.702s
sys 0m0.062s
$ time awk -v N=20 'NR==FNR{nr++;next}FNR>(nr-N)' file100k file100k >
/dev/null

real 0m0.368s
user 0m0.327s
sys 0m0.078s
$ time awk -v N=20 -f tst1 file10k > /dev/null

real 0m0.127s
user 0m0.140s
sys 0m0.015s
$ time awk -v N=20 'NR==FNR{nr++;next}FNR>(nr-N)' file10k file10k >
/dev/null

real 0m0.088s
user 0m0.077s
sys 0m0.047s
$ time awk -v N=20 -f tst1 file1k > /dev/null

real 0m0.102s
user 0m0.077s
sys 0m0.030s
$ time awk -v N=20 'NR==FNR{nr++;next}FNR>(nr-N)' file1k file1k > /dev/null

real 0m0.068s
user 0m0.046s
sys 0m0.030s
$ time awk -v N=2 -f tst1 file100k > /dev/null

real 0m0.661s
user 0m0.640s
sys 0m0.046s
$ time awk -v N=2 'NR==FNR{nr++;next}FNR>(nr-N)' file100k file100k >
/dev/null

real 0m0.369s
user 0m0.311s
sys 0m0.077s

Each line of each input file is just the line number.

Regards,

Ed.
Antonio Dell'elce

2006-12-05, 6:56 pm


Ed Morton wrote:[color=darkred]
> Janis wrote:
>
> <snip>
>
> I used to think I/O was slow too. I don't recall who it was or what the
> reasoning was, but I remember someone who I thought knew what they were
> talking about convincing me that multiple "print"s is faster than
> building up a string and then printing it once so that changed my opinion.
>


I believe that when you read multiple time the same files you take
advantage of the
O.S. buffercache, so when you'll read the first time it will be a bit
slower the other times
will be a bit faster.... it will grow as faster as possible depending
on the availability
of the buffer cache.

In a system under no stress with plenty of memory the reading a file
multiple times
might be reasonable but in enterprise systems with high load it may be
less acceptable.

You can test this disabling buffer cache on your O.S. (which is
possible on most unix variants).

Antonio

Janis Papanagnou

2006-12-05, 6:56 pm

Ed Morton wrote:
> Janis wrote:
>
>
> <snip>
>
>
>
> I used to think I/O was slow too. I don't recall who it was or what the
> reasoning was, but I remember someone who I thought knew what they were
> talking about convincing me that multiple "print"s is faster than
> building up a string and then printing it once so that changed my opinion.


That's not quite the same. Building strings depends on the string library
implementation. Multiple prints depend in most cases on the buffered I/O
library.

>
>
>
> Here y'go (gawk 3.1.5 on Cygwin in Windows XP):


Thanks for the numbers.

Did you take care in the tests that you either have disabled buffering
or called the respective file a couple of times before starting with the
programs? Otherwise the first call will produce larger harddisk access
times while the subsequent files will access the cache. (The first method
of disabling the cache gives the values most relevant in practice, the
second one is just a cludge.)

Janis

>
> $ cat tst1
> { lines[n++ % N] = $0 }
> END {
> for(i = n; i < n + N; i++) {
> j = i % N
> if(j in lines) {
> print lines[j]
> }
> }
> }
> $ wc -l file1*k
> 100000 file100k
> 10000 file10k
> 1000 file1k
> 111000 total
> $ time awk -v N=20 -f tst1 file100k > /dev/null
>
> real 0m0.742s
> user 0m0.702s
> sys 0m0.062s
> $ time awk -v N=20 'NR==FNR{nr++;next}FNR>(nr-N)' file100k file100k >
> /dev/null
>
> real 0m0.368s
> user 0m0.327s
> sys 0m0.078s
>[snip]
> Each line of each input file is just the line number.
>
> Regards,
>
> Ed.

Vassilis

2006-12-05, 6:56 pm


=CF/=C7 Janis Papanagnou =DD=E3=F1=E1=F8=E5:[color=darkred]
> Ed Morton wrote:
>
> Thanks for the numbers.
>
> Did you take care in the tests that you either have disabled buffering
> or called the respective file a couple of times before starting with the
> programs? Otherwise the first call will produce larger harddisk access
> times while the subsequent files will access the cache. (The first method
> of disabling the cache gives the values most relevant in practice, the
> second one is just a cludge.)
>
> Janis
>
0k >[color=darkred]

I can confirm Ed's numbers. I've run these two scripts on my windos
box, 20 times each, on a file of 1000k lines (my cache2 is 128 Kb).

hash.awk 3.34455 (mean real time in seconds)
2read.awk 2.4961 (mean real time in seconds)

Antonio Dell'elce

2006-12-05, 6:56 pm


Vassilis wrote:
> =CF/=C7 Janis Papanagnou =DD=E3=F1=E1=F8=E5:
ay[color=darkred]
th[color=darkred]
ple[color=darkred]
od[color=darkred]
100k >[color=darkred]
>
> I can confirm Ed's numbers. I've run these two scripts on my windos
> box, 20 times each, on a file of 1000k lines (my cache2 is 128 Kb).
>
> hash.awk 3.34455 (mean real time in seconds)
> 2read.awk 2.4961 (mean real time in seconds)


what do you mean by "cache2 is 128k"? As far as my knowledge goes
windows does
not support disabling buffers on filesystems and does not let fiddle
with how it works
if you mean processor's L2 cache I don't believe it affects test of
most I/O related tests.

Antonio

Vassilis

2006-12-05, 6:56 pm


=CF/=C7 Antonio Dell'elce =DD=E3=F1=E1=F8=E5:
> Vassilis wrote:
>
> what do you mean by "cache2 is 128k"? As far as my knowledge goes
> windows does
> not support disabling buffers on filesystems and does not let fiddle
> with how it works
> if you mean processor's L2 cache I don't believe it affects test of
> most I/O related tests.
>
> Antonio


Yes, I meant L2 cache, sorry.
I was thinking about program size + data passing through cache, but
didn't see the disk :(
You're right, cache /might/ affect times, in case program and data are
small enough.
Ok, now, someone must try this on linux.

Ed Morton

2006-12-05, 6:56 pm

Janis Papanagnou wrote:
<snip>
> Did you take care in the tests that you either have disabled buffering
> or called the respective file a couple of times before starting with the
> programs?


No, I just threw something together. If anyone feels like being more
thorough, I'll be slightly interested in the results....

Ed.
Janis Papanagnou

2006-12-05, 9:56 pm

Ed Morton wrote:
> Janis Papanagnou wrote:
> <snip>
>
>
>
> No, I just threw something together. If anyone feels like being more
> thorough, I'll be slightly interested in the results....
>
> Ed.


I used the method to restart the programs a couple times so that the data
file will be in cache for both programs, and then run them alternately. I
increased the size of the data file to 1000k lines and varied the value of
N to see how much the array size increases resp. reduces the performance.

The results (on my old Linux 2.2 box) are quite as I expected; the "array
solution" is ~15% faster than the "two-pass solution" for the smaller value
of N=20 and it's ~10% slower for a much larger value of N=2000. They are on
par at N=200. For N=2 the "array solution" is even ~30% faster.

The "two-pass solution" time is independent from the value of N, as expected,
while the "array solution" time increases with N.

Janis


$ time awk -v N=2 -f array.awk file1000k >/dev/null

real 0m1,41s
user 0m1,37s
sys 0m0,00s


$ time awk -v N=20 -f array.awk file1000k >/dev/null

real 0m1,72s
user 0m1,69s
sys 0m0,00s

$ time awk -v N=20 -f twopass.awk file1000k file1000k >/dev/null

real 0m2,00s
user 0m1,96s
sys 0m0,00s


$ time awk -v N=2000 -f array.awk file1000k >/dev/null

real 0m2,19s
user 0m2,15s
sys 0m0,00s

$ time awk -v N=2000 -f twopass.awk file1000k file1000k >/dev/null

real 0m2,00s
user 0m1,96s
sys 0m0,00s

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com