Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

Fields not always fixed.
Hello all,

I am new to this group and just starting to get my hands on the akw(nawk).
I am trying to process text (structure shown below) which has some fixed
fields. I know the approx position of some fields and also pattern of some
fields but can't figure out what to do with the fields that flactuate -
"Description" and "CODE1". The portion of the data that I'm working right
now looks like this:

Item           Qty                         Description
CODE1        Price    CODE2
12-3456             10    DESCRIPTION THAT SOMETIMES WILL WRAP BUT I
1-1234-5678-9       12.34
DONT CARE ABOUT THIS WRAPPED PART JUST NOW
A1234               10    THIS DESCRIOPTION LINE DID NOT WRAP ALL
1-1234-5678-9       1.23   51653
12-3456             10    DESCRIPTION THAT SOMETIMES WILL WRAP BUT I
1-1234-5678-9                 12.34
DONT CARE ABOUT THIS WRAPPED PART JUST NOW

"Item" field is sort of fixed and I can get it this way:
ITEM.=substr($0,1,16);
# I know max. length
while (ITEM ~ /^ /) ITEM=substr(ITEM, 2);                             #
cleaning out leading
while (ITEM ~ / $/) ITEM=substr(ITEM, 1, length(ITEM)-1);    # and
trailing spaces

"Qty" - same as "Item":
QUAN=substr($0,17,8);
while (QUAN ~ /^ /) QUAN=substr(QUAN, 2);
while (QUAN ~ / $/) QUAN=substr(QUAN, 1, length(QUAN)-1);

"Description" - I know the starting position but ending sometimes overlaps
with starting position of "CODE1"

"CODE1" will always look like this (13 characters with 0-9 or a dash):
[0-9-Xx][0-9-Xx][0-9-Xx][0-9-Xx][0-9-Xx][0-9-Xx][
;0-9-Xx][0-9-Xx][0-9-Xx][0-9
-Xx][0-9-Xx][0-9-Xx][0-9-Xx]
most of the time I know where the "CODE1" starts but look at item: 73-6046,
"CODE1" follows the "Description".

"Price" I know where it starts and ends, and will look like this:
[0-9][0-9]*\.[0-9][0-9]

"CODE2" is not always present but I know the starting position and it will
look like this:
[0-9a-zA-Z]*

The data is not all that bad, and most of the lines follow a good pattern
where I can get all the fields, but I'm showing some exceptions with extreme
situations.
Help will be appreciated with further strategy of how to get "Description"
and "CODE1" reliably with those exceptions. I have thought about using sed
to prepare the data better but I'd like to exhaust the awk possibilities
first.

Many thanks,

David



Report this thread to moderator Post Follow-up to this message
Old Post
Dave
03-20-04 01:23 AM


Re: Fields not always fixed.
In article <6QMYb.339338$I06.3542631@attbi_s01>,
Dave <withheld@nospam.thanks> wrote:

[...]

% "Item" field is sort of fixed and I can get it this way:
%     ITEM.=substr($0,1,16);
% # I know max. length
%     while (ITEM ~ /^ /) ITEM=substr(ITEM, 2);                             
#
% cleaning out leading
%     while (ITEM ~ / $/) ITEM=substr(ITEM, 1, length(ITEM)-1);    # and
% trailing spaces

I would gsub for this:
gsub(ITEM, /^ +| +$/, "")

% "Description" - I know the starting position but ending sometimes overlaps
% with starting position of "CODE1"

That's unfortunate. I suggest looking at $NF, $(NF-1), and $(NF-2). You
could compare $(NF-2) and $(NF-1) to the pattern for CODE1, which you
really ought to refine a bit, and use that to determine whether CODE2
is present or not. You can use index to find CODE1 and use that to get
an upper bound on DESCRIPTION.

# code 2 not present -- note that - must come at the start or end of
# the [] or it will be treated as a range operator. This will work
# with any POSIX-compliant awk. With other awks, I suggest replacing
# {13} with + and testing for the length of $(NF-1) explicitly
$(NF-1) ~ /^[0-9Xx-]{13}$/ { code1 = $(NF-1); price = $NF; cod
e2="" }
$(NF-2) ~ /^[0-9Xx-]{13}$/ { code1 = $(NF-2); price = $(NF-1);
 code2=$NF }
{ desc = substr($0, descoffset, index($0, code1) }

You might find it a bit faster to first test the known position of CODE1
to see if it matches the RE, then use index() as a fallback for the
exceptions.

--

Patrick TJ McPhee
East York  Canada
ptjm@interlog.com

Report this thread to moderator Post Follow-up to this message
Old Post
Patrick TJ McPhee
03-20-04 01:23 AM


Sponsored Links




Last Thread Next Thread Next
Search this forum -> 
Post New Thread

AWK archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 05:52 AM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.