Home > Archive > AWK > March 2004 > Fields not always fixed.
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Fields not always fixed.
|
|
|
| Hello all,
I am new to this group and just starting to get my hands on the akw(nawk).
I am trying to process text (structure shown below) which has some fixed
fields. I know the approx position of some fields and also pattern of some
fields but can't figure out what to do with the fields that flactuate -
"Description" and "CODE1". The portion of the data that I'm working right
now looks like this:
Item Qty Description
CODE1 Price CODE2
12-3456 10 DESCRIPTION THAT SOMETIMES WILL WRAP BUT I
1-1234-5678-9 12.34
DONT CARE ABOUT THIS WRAPPED PART JUST NOW
A1234 10 THIS DESCRIOPTION LINE DID NOT WRAP ALL
1-1234-5678-9 1.23 51653
12-3456 10 DESCRIPTION THAT SOMETIMES WILL WRAP BUT I
1-1234-5678-9 12.34
DONT CARE ABOUT THIS WRAPPED PART JUST NOW
"Item" field is sort of fixed and I can get it this way:
ITEM.=substr($0,1,16);
# I know max. length
while (ITEM ~ /^ /) ITEM=substr(ITEM, 2); #
cleaning out leading
while (ITEM ~ / $/) ITEM=substr(ITEM, 1, length(ITEM)-1); # and
trailing spaces
"Qty" - same as "Item":
QUAN=substr($0,17,8);
while (QUAN ~ /^ /) QUAN=substr(QUAN, 2);
while (QUAN ~ / $/) QUAN=substr(QUAN, 1, length(QUAN)-1);
"Description" - I know the starting position but ending sometimes overlaps
with starting position of "CODE1"
"CODE1" will always look like this (13 characters with 0-9 or a dash):
[0-9-Xx][0-9-Xx][0-9-Xx][0-9-Xx][0-9-Xx][0-9-Xx][0-9-Xx][0-9-Xx][0-9-Xx][0-9
-Xx][0-9-Xx][0-9-Xx][0-9-Xx]
most of the time I know where the "CODE1" starts but look at item: 73-6046,
"CODE1" follows the "Description".
"Price" I know where it starts and ends, and will look like this:
[0-9][0-9]*\.[0-9][0-9]
"CODE2" is not always present but I know the starting position and it will
look like this:
[0-9a-zA-Z]*
The data is not all that bad, and most of the lines follow a good pattern
where I can get all the fields, but I'm showing some exceptions with extreme
situations.
Help will be appreciated with further strategy of how to get "Description"
and "CODE1" reliably with those exceptions. I have thought about using sed
to prepare the data better but I'd like to exhaust the awk possibilities
first.
Many thanks,
David
| |
| Patrick TJ McPhee 2004-03-19, 8:23 pm |
| In article <6QMYb.339338$I06.3542631@attbi_s01>,
Dave <withheld@nospam.thanks> wrote:
[...]
% "Item" field is sort of fixed and I can get it this way:
% ITEM.=substr($0,1,16);
% # I know max. length
% while (ITEM ~ /^ /) ITEM=substr(ITEM, 2); #
% cleaning out leading
% while (ITEM ~ / $/) ITEM=substr(ITEM, 1, length(ITEM)-1); # and
% trailing spaces
I would gsub for this:
gsub(ITEM, /^ +| +$/, "")
% "Description" - I know the starting position but ending sometimes overlaps
% with starting position of "CODE1"
That's unfortunate. I suggest looking at $NF, $(NF-1), and $(NF-2). You
could compare $(NF-2) and $(NF-1) to the pattern for CODE1, which you
really ought to refine a bit, and use that to determine whether CODE2
is present or not. You can use index to find CODE1 and use that to get
an upper bound on DESCRIPTION.
# code 2 not present -- note that - must come at the start or end of
# the [] or it will be treated as a range operator. This will work
# with any POSIX-compliant awk. With other awks, I suggest replacing
# {13} with + and testing for the length of $(NF-1) explicitly
$(NF-1) ~ /^[0-9Xx-]{13}$/ { code1 = $(NF-1); price = $NF; code2="" }
$(NF-2) ~ /^[0-9Xx-]{13}$/ { code1 = $(NF-2); price = $(NF-1); code2=$NF }
{ desc = substr($0, descoffset, index($0, code1) }
You might find it a bit faster to first test the known position of CODE1
to see if it matches the RE, then use index() as a fallback for the
exceptions.
--
Patrick TJ McPhee
East York Canada
ptjm@interlog.com
|
|
|
|
|