For Programmers: Free Programming Magazines  


Home > Archive > AWK > February 2007 > The missing years









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author The missing years
Hermann Peifer

2007-02-18, 6:56 pm

Hi All,

I have records with the following field logic:
station_code start_year end_year year year year ... year

0884A 1996 2005 1996 1997 2001 2002 2003 1998 2000 2004 2005
0885A 2001 2005 2002 2003 2001 2005
0886A 1996 2003 1996 1997 2000 2001 1999 2003

I'd like to know which years between start_year and end_year are NOT
listed in the following fields. The expected output for the above
example would be:

0884A 1999
0885A 2004
0886A 1998
0886A 2002

I thought in constructing an array with all years between start_year
and end_year, then deleting individual array elements based on the years
in the record. At the end I would scan the array and see what is left.

Any help is welcome.

TIA. Hermann
Cesar Rabak

2007-02-18, 6:56 pm

Hermann Peifer escreveu:
> Hi All,
>
> I have records with the following field logic:
> station_code start_year end_year year year year ... year
>
> 0884A 1996 2005 1996 1997 2001 2002 2003 1998 2000 2004 2005
> 0885A 2001 2005 2002 2003 2001 2005
> 0886A 1996 2003 1996 1997 2000 2001 1999 2003
>
> I'd like to know which years between start_year and end_year are NOT
> listed in the following fields. The expected output for the above
> example would be:
>
> 0884A 1999
> 0885A 2004
> 0886A 1998
> 0886A 2002
>
> I thought in constructing an array with all years between start_year and
> end_year, then deleting individual array elements based on the years in
> the record. At the end I would scan the array and see what is left.
>
> Any help is welcome.
>

Hi Hermann,

I would refine your analisys as follows:

for each line of data I would build a two dimension array indexed by $NR
and year, initialize it from tart_year end_year, looping trough the
values , with zero.

then I would read the rest of the line ($4 to $NF) adding one to the
values in the array indexed by contents of the fields (years).

in the END block, you iterate for record number (second thought: better
even if you assign the station instead of number of record!) and search
the contents of zero. Those are the years missing.

This approach may be not appropriate if the file to be processed is too
big, case where you may think of putting all this machinery in the line
processin logic.

HTH

--
Cesar Rabak

Janis Papanagnou

2007-02-18, 6:56 pm

Hermann Peifer wrote:
> Hi All,
>
> I have records with the following field logic:
> station_code start_year end_year year year year ... year
>
> 0884A 1996 2005 1996 1997 2001 2002 2003 1998 2000 2004 2005
> 0885A 2001 2005 2002 2003 2001 2005
> 0886A 1996 2003 1996 1997 2000 2001 1999 2003
>
> I'd like to know which years between start_year and end_year are NOT
> listed in the following fields. The expected output for the above
> example would be:
>
> 0884A 1999
> 0885A 2004
> 0886A 1998
> 0886A 2002


awk '
{ for (nf=4;nf<=NF;nf++) ya[$nf]
for (y=$2;y<=$3;y++)
if(!(y in ya)) print $1,y
delete ya
}'


Janis

>
> I thought in constructing an array with all years between start_year and
> end_year, then deleting individual array elements based on the years in
> the record. At the end I would scan the array and see what is left.
>
> Any help is welcome.
>
> TIA. Hermann

Anton Treuenfels

2007-02-18, 6:56 pm


"Janis Papanagnou" <Janis_Papanagnou@hotmail.com> wrote in message
news:eraifd$1p3$1@online.de...

> awk '
> { for (nf=4;nf<=NF;nf++) ya[$nf]
> for (y=$2;y<=$3;y++)
> if(!(y in ya)) print $1,y
> delete ya
> }'


Yeah, well, that's effective and all, but I can do it slower!

{
for ( y = $2; y <= $3; y++ ) {
for ( yr = 4; yr <= NF; yr++ ) {
if ( $yr == y )
break
}
if ( yr > NF )
print $1, y
}
}

- Anton Treuenfels


Hermann Peifer

2007-02-19, 3:56 am

Anton Treuenfels wrote:
> "Janis Papanagnou" <Janis_Papanagnou@hotmail.com> wrote in message
> news:eraifd$1p3$1@online.de...
>
>
> Yeah, well, that's effective and all, but I can do it slower!
>
> {
> for ( y = $2; y <= $3; y++ ) {
> for ( yr = 4; yr <= NF; yr++ ) {
> if ( $yr == y )
> break
> }
> if ( yr > NF )
> print $1, y
> }
> }
>
> - Anton Treuenfels
>


Thanks to both of you.

I am sure both options will work. It is a bit early now. I will try out
later today.

Hermann
Hermann Peifer

2007-02-19, 3:56 am

Cesar Rabak wrote:
> Hermann Peifer escreveu:
> Hi Hermann,
>
> I would refine your analisys as follows:
>
> for each line of data I would build a two dimension array indexed by $NR
> and year, initialize it from tart_year end_year, looping trough the
> values , with zero.
>
> then I would read the rest of the line ($4 to $NF) adding one to the
> values in the array indexed by contents of the fields (years).
>
> in the END block, you iterate for record number (second thought: better
> even if you assign the station instead of number of record!) and search
> the contents of zero. Those are the years missing.
>
> This approach may be not appropriate if the file to be processed is too
> big, case where you may think of putting all this machinery in the line
> processin logic.
>
> HTH


Thanks. I am sure this will work. Currently, I only have a smaller file
to process (50000 records). I guess this is not too big for the proposed
approach.

In case of bigger files, I could indeed put "all this machinery in the
line processing logic", as suggested by Janis and Anton.

Hermann
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com