For Programmers: Free Programming Magazines  


Home > Archive > AWK > November 2006 > How to unfold files (was: Matrix transposition problem)









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author How to unfold files (was: Matrix transposition problem)
Hermann Peifer

2006-11-26, 3:57 am

Hermann wrote:
> I have 1000s of data files with 100s of lines each. Each line has a
> tab-separated data series corresponding to 1 month. The field logic of
> the lines is as follows:
> <startday> <value1> <flag1> <value2> <flag2> (...) <value31> <flag31>
>
> There are always 31 values and 31 flags in a line, even for months with
> less than 31 days. Real data lines look like this:
>
> 2000-01-01 0.2 1 0.7 0 (...) 1.0 -2
> 2000-02-01 0.1 -1 0.9 1 (...) 2.0 -1
> 2000-03-01 0.3 1 1.7 0 (...) 3.0 0
>
>
> I would like to transpose the data matrix in order to have separate
> lines per day, i.e. each input line is transposed into 31 output lines,
> like this:
>
> <startday> <value1> <flag1>
> <startday+1> <value2> <flag2>
> ...
> <startday+30> <value31> <flag31>


My AWK script is ready and seems to do what I want. In addition to the
original problem from above, I also have to cut the FILENAME into pieces
and use the snippets in the output lines. Then I have 2 types of input
lines (daily and hourly data) which I want to convert into into a
standard output format. A 3rd type of lines (wly, monthly of yearly
data) has to be ignored.

Thanks again to Janis for helping me.

Before applying the script to some 46000+ files (with a total of ~35
million input lines resulting into ~900 million output lines):

Could someone perhaps have a look and see if the script looks reasonable
(for a beginner) or if there are some (potential) problems?

--snip--
BEGIN {FS = OFS= "\t"}
{
if (FNR == 1) {
split(FILENAME,f,".") # split filename
sc = substr(FILENAME,1,7) # get station code
cp = substr(FILENAME,8,5) # get component number
ms = substr(FILENAME,13,5) # get measurement number
dt = substr(FILENAME,18,length(f[1])-17) # get data type
}
if (NF == 63) { # line with daily data
d = substr($1,1,8) # get YYYY-MM- from $1
for (i=1;i<=31;i++) # unfold to 31 lines
print sc,cp,ms,dt,sprintf("%s%02d",d,i),"00:00",$(i*2),$(i*2+1)
}
else if (NF == 49) { # line with hourly data
for (i=1;i<=24;i++) # unfold to 24 lines
print sc,cp,ms,dt,$1,sprintf("%02d",i)":00",$(i*2),$(i*2+1)
}
else {next} # do nothing (other data)
}
--snip--

In case someone wants to try it out: sample files are available at:
http://cdrtest.eionet.europa.eu/at/...leaoq/envrwlbtw

Thanks in advance, Hermann
Vassilis

2006-11-26, 7:56 am


=CF/=C7 Hermann Peifer =DD=E3=F1=E1=F8=E5:
> Hermann wrote:
>
> My AWK script is ready and seems to do what I want. In addition to the
> original problem from above, I also have to cut the FILENAME into pieces
> and use the snippets in the output lines. Then I have 2 types of input
> lines (daily and hourly data) which I want to convert into into a
> standard output format. A 3rd type of lines (wly, monthly of yearly
> data) has to be ignored.
>
> Thanks again to Janis for helping me.
>
> Before applying the script to some 46000+ files (with a total of ~35
> million input lines resulting into ~900 million output lines):
>
> Could someone perhaps have a look and see if the script looks reasonable
> (for a beginner) or if there are some (potential) problems?
>
> --snip--
> BEGIN {FS =3D OFS=3D "\t"}
> {
> if (FNR =3D=3D 1) {
> split(FILENAME,f,".") # split filename
> sc =3D substr(FILENAME,1,7) # get station code
> cp =3D substr(FILENAME,8,5) # get component number
> ms =3D substr(FILENAME,13,5) # get measurement number
> dt =3D substr(FILENAME,18,length(f[1])-17) # get data type
> }
> if (NF =3D=3D 63) { # line with daily data
> d =3D substr($1,1,8) # get YYYY-MM- from $1
> for (i=3D1;i<=3D31;i++) # unfold to 31 lines
> print sc,cp,ms,dt,sprintf("%s%02d",d,i),"00:00",$(i*2),$(i*2+1)
> }
> else if (NF =3D=3D 49) { # line with hourly data
> for (i=3D1;i<=3D24;i++) # unfold to 24 lines
> print sc,cp,ms,dt,$1,sprintf("%02d",i)":00",$(i*2),$(i*2+1)
> }
> else {next} # do nothing (other data)
> }
> --snip--
>
> In case someone wants to try it out: sample files are available at:
> http://cdrtest.eionet.europa.eu/at/...leaoq/envrwlbtw
>
> Thanks in advance, Hermann


The natural way (in awk) is to write:

BEGIN { FS =3D OFS =3D "\t" }
FNR =3D=3D 1 { ... }
NF =3D=3D 63 { ... }
..=2E.

Awk is particulary good in verification too.
Why don't you try something like in the lines of this:

BEGIN { FS =3D "\t" }
$1 !=3D "AT0001A" { print FNR, NF, "error\t" $0 > "/dev/stderr" }
$2 !=3D "00001" { ... }
$3 !=3D "00100" { ... }
$4 !~ "(day|hour)" { ... }
$5 !~ /[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]/ { ... }

Of course, you know better what to expect in each field :)

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2009 codecomments.com