Home > Archive > AWK > January 2006 > multi-dimensional arrays
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
multi-dimensional arrays
|
|
| confuzzled 2006-01-10, 3:58 am |
| What I want to do is simple in concept, but is turning out to be quite
the bear.
I have multiple blocks of data all in one file, each representing a
unique entity and/or a record of data. The data is formatted something
like this:
Name: blah blah
Address: street address
city,state,zip
Date: blah/blah/blah
License Status: blah blah
etc.
The above block represents one entity or record as already mentioned.
Each block, has a unique number of lines associated with it. There is
generally a minimum of 7 lines of data per block, but there could be
more. Each block is seperated by a blank line so as to be able to tell
where each block begins and ends.
Depending on the data in License Status, I either want to print the
whole block or simply discard it. Actually, I want to eliminate the
field name and just print the data. So instead of
"Name: first,last" it prints "first,last", etc. (I tried pre-parsing
the file using cut & grep, but that doesn't work so well for the simple
reason that the data in License Status sometimes shows up elsewhere in
the record (with a different meaning), so grepping for the data doesn't
help - I get false positives.)
So, I've decided that the best thing to do is create an array to hold
the whole block, while I process each line until I get to License
Status and determine it's state.
If anyone has a better idea, I'm open to suggestions (my knowledge of
Perl is limited at best and will be a huge learning curve, so I'd
prefer to use awk if possible). However, assuming we go with the
method I've decided on using, I'm stuck as follows.
First, I split each line into a two part array. The data before the
colon, and the data after the colon. Note, that not all lines contain
colons, in particular the address line which consists of two lines.
So, I use n=split($0, array, ":"); That works just fine. I print out
array[n] which gives me exactly what I want on a line by line basis.
However, at this point I realized that I really need a two dimensional
array, so as to hold the entire block of data, not just a single line.
And that's where I get stuck.
I tried:
j=0; done=0;
while (!done)
{
(1) n=split($0, array[j], ":");
printf("%s", array[j,n]);
j++;
next;
if ($0 == "") done=1;
}
I also tried n=split($0, array[j,0], ":"); on the line labeled (1)
above. Neither works and I don't know what I'm doing wrong. Any help?
It probably doesn't matter, but just in case it helps - the final
result when all is said and done, is a file of comma seperated - quoted
- fields & records, which I then import into a spreadsheet. ie. Each
record/line consists of: "first,last", "address", "date", ...
If worst comes to worst, I can use what I've got and simply sort &
delete the records I don't want out of the spreadsheet (which is
actually what I'm doing now). But, I'd prefer not to import useless
records in the first place if at all possible.
Email response preferred as I don't generally read this group. I don't
program on a regular basis anymore, I just need to get this one thing
done to get on with my real job. I've done a great deal of work on
this, and this is the final piece to finish it off.
Thank you in advance.
| |
| William James 2006-01-10, 3:58 am |
| confuzzled wrote:
> I have multiple blocks of data all in one file, each representing a
> unique entity and/or a record of data. The data is formatted something
> like this:
>
> Name: blah blah
> Address: street address
> city,state,zip
> Date: blah/blah/blah
> License Status: blah blah
> etc.
>
> The above block represents one entity or record as already mentioned.
> Each block, has a unique number of lines associated with it. There is
> generally a minimum of 7 lines of data per block, but there could be
> more. Each block is seperated by a blank line so as to be able to tell
> where each block begins and ends.
So Awk must be told that a record is terminated not by
a newline character but by an empty line:
RS = ""
If the blank lines are not empty, i.e., if they may contain spaces
or tabs, you could use Mawk or Gawk and say
RS = "\n([ \t]*\n)+"
>
> Depending on the data in License Status, I either want to print the
> whole block or simply discard it. Actually, I want to eliminate the
> field name and just print the data. So instead of
> "Name: first,last" it prints "first,last", etc. (I tried pre-parsing
> the file using cut & grep, but that doesn't work so well for the simple
> reason that the data in License Status sometimes shows up elsewhere in
> the record (with a different meaning), so grepping for the data doesn't
> help - I get false positives.)
>
> It probably doesn't matter, but just in case it helps - the final
> result when all is said and done, is a file of comma seperated - quoted
> - fields & records, which I then import into a spreadsheet. ie. Each
> record/line consists of: "first,last", "address", "date", ...
> Email response preferred as I don't generally read this group.
Even though you generally don't read this group, surely you
will realize that you should read this group when you have
a question posted in this group.
BEGIN {
# The record-separator will be an empty line.
RS = ""
# The field-separator will be a newline.
FS = "\n"
}
/\nLicense Status: +expired/ {
sep = ""
for (i=1; i<= NF; i++)
{ n = split( $i, array, ": +" )
printf "%s\"%s\"", sep, array[ n ]
sep = ", "
}
print ""
}
| |
| Ed Morton 2006-01-10, 3:58 am |
| confuzzled wrote:
> What I want to do is simple in concept, but is turning out to be quite
> the bear.
>
> I have multiple blocks of data all in one file, each representing a
> unique entity and/or a record of data. The data is formatted something
> like this:
>
> Name: blah blah
> Address: street address
> city,state,zip
> Date: blah/blah/blah
> License Status: blah blah
> etc.
>
> The above block represents one entity or record as already mentioned.
> Each block, has a unique number of lines associated with it. There is
> generally a minimum of 7 lines of data per block, but there could be
> more. Each block is seperated by a blank line so as to be able to tell
> where each block begins and ends.
>
> Depending on the data in License Status, I either want to print the
> whole block or simply discard it. Actually, I want to eliminate the
> field name and just print the data. So instead of
> "Name: first,last" it prints "first,last", etc.
Try this:
$ awk -v RS= '/(^|\n)License Status: blah blah\n/ &&
gsub(/\n[^:]*:/,"\n") && sub(/^[^:]*: /,"")' file
or this if you have gawk:
$ awk -v RS= '/(^|\n)License Status: blah blah\n/{print
gensub(/(^|\n)[^:]*: /,"\\1","g")}' file
replacing "blah blah" with whatever value you want to find.
Ed.
| |
| Ed Morton 2006-01-10, 3:58 am |
| Ed Morton wrote:
> confuzzled wrote:
>
>
>
> Try this:
>
> $ awk -v RS= '/(^|\n)License Status: blah blah\n/ &&
> gsub(/\n[^:]*:/,"\n") && sub(/^[^:]*: /,"")' file
>
> or this if you have gawk:
>
> $ awk -v RS= '/(^|\n)License Status: blah blah\n/{print
> gensub(/(^|\n)[^:]*: /,"\\1","g")}' file
>
> replacing "blah blah" with whatever value you want to find.
>
Hmm,. I missed an "|$" at the end of the License Status REs (to
accomodate that bing the last line of each record:
$ awk -v RS= '/(^|\n)License Status: blah blah(\n|$)/ && gsub(/\n[^:]*:
/,"\n") && sub(/^[^:]*: /,"")' file
or this if you have gawk:
$ awk -v RS= '/(^|\n)License Status: blah blah(\n|$)/{print
gensub(/(^|\n)[^:]*: /,"\\1","g")}' file
and I also just noticed this additional requirement in your original
posting:
> It probably doesn't matter, but just in case it helps - the final
> result when all is said and done, is a file of comma seperated - quoted
> - fields & records, which I then import into a spreadsheet. ie. Each
> record/line consists of: "first,last", "address", "date", ...
so just tweak the first of the above to:
$ awk -v RS= '/(^|\n)License Status: blah blah(\n|$)/{gsub(/\n[^:]*:
/,"\",\"");sub(/^[^:]*: /,"\"");$0=$0"\"";print}' file
Ed.
|
|
|
|
|