Home > Archive > AWK > October 2006 > A new project with awk
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
A new project with awk
|
|
| Greg Michael 2006-10-30, 7:01 pm |
| Hi again. I want to thank Ed Morton and Ted Davis for their assistance with
my last request.
I am working on a new project and am running into limitation problems as
well as knowledge problems. We are running HP-UX 11i, and unfortunately, I
don't know nor know how to determine the version of awk I'm using.
I have a CSV file with a dump of records from a table in one of our
databases. In this CSV file, there is a specific column that I need to take
the data from using a ERE to pick out the fields that are actually useful.
Once I have pulled these records out, I need to loop through such that each
individual record is searched for in archived files (just archived, not
compressed, tarred, zipped, or otherwise made smaller.) When the field is
not found, I need to output a line to a file. When the field is found in a
file, I can skip it.
The first problem that I was finding was that awk is complaining that the
first line is too long:
awk: Input line AUDIT_NUMBER,AUDIT_D cannot be longer than 3,000 bytes
The first line of the CSV file is the header information for each column,
and looks like this:
AUDIT_NUMBER,AUDIT_DATE,AUDIT_TYPE,AUDIT
_USER,PRD_LVL_CHILD,PRD_LVL_PARENT,PRD_L
VL_ID,PRD_NAME_FULL,PRD_TARGETGM,
PRD_LVL_NUMBER,PRD_LVL_ACTIVE,PRD_STYLE_
IND,PRD_STATUS,PRD_ACT_VAL,PRD_INH_VAL,P
RD_UOM_SIZE,PRD_SLL_UOM,PRD_COMP_
UOM,PRD_CONV_QTY,PRD_CROSS_DOCK,PRD_LVL_
NUMBER_OLD,PRD_LVL_PARENT_OLD
The field that I'm most interested in is the PRD_LVL_NUMBER field, which is
the 10th field. The data in this field is somewhat unstructured, but, what
I'm looking for is a field that is either an 8-digit numeric number, or a
7-digit numeric number with a hyphen ( - ) between characters 5 and 6
(12345678 or 12345-67).
When I find records that match the above criteria, I need to take the value
in that 10th field, and grep for it in several hundred archived files. When
the number is found, it can be ignored, but when the number is not found in
any of those files, I need to output that information, whether to a file or
the screen isn't important right now.
Thank you all in advance for whatever help you can provide.
Greg
| |
| Ed Morton 2006-10-30, 7:01 pm |
| Greg Michael wrote:
> Hi again. I want to thank Ed Morton and Ted Davis for their assistance with
> my last request.
>
> I am working on a new project and am running into limitation problems as
> well as knowledge problems. We are running HP-UX 11i, and unfortunately, I
> don't know nor know how to determine the version of awk I'm using.
>
> I have a CSV file with a dump of records from a table in one of our
> databases. In this CSV file, there is a specific column that I need to take
> the data from using a ERE to pick out the fields that are actually useful.
> Once I have pulled these records out, I need to loop through such that each
> individual record is searched for in archived files (just archived, not
> compressed, tarred, zipped, or otherwise made smaller.) When the field is
> not found, I need to output a line to a file. When the field is found in a
> file, I can skip it.
>
> The first problem that I was finding was that awk is complaining that the
> first line is too long:
>
> awk: Input line AUDIT_NUMBER,AUDIT_D cannot be longer than 3,000 bytes
>
> The first line of the CSV file is the header information for each column,
> and looks like this:
>
> AUDIT_NUMBER,AUDIT_DATE,AUDIT_TYPE,AUDIT
_USER,PRD_LVL_CHILD,PRD_LVL_PARENT,PRD_L
VL_ID,PRD_NAME_FULL,PRD_TARGETGM,
> PRD_LVL_NUMBER,PRD_LVL_ACTIVE,PRD_STYLE_
IND,PRD_STATUS,PRD_ACT_VAL,PRD_INH_VAL,P
RD_UOM_SIZE,PRD_SLL_UOM,PRD_COMP_
> UOM,PRD_CONV_QTY,PRD_CROSS_DOCK,PRD_LVL_
NUMBER_OLD,PRD_LVL_PARENT_OLD
>
> The field that I'm most interested in is the PRD_LVL_NUMBER field, which is
> the 10th field. The data in this field is somewhat unstructured, but, what
> I'm looking for is a field that is either an 8-digit numeric number, or a
> 7-digit numeric number with a hyphen ( - ) between characters 5 and 6
> (12345678 or 12345-67).
awk -F, '$10 ~ /[0-9][0-9][0-9][0-9][0-9](-|[0-9])[0-9][0-9]/{print
$10}' file
> When I find records that match the above criteria, I need to take the value
> in that 10th field, and grep for it in several hundred archived files. When
> the number is found, it can be ignored, but when the number is not found in
> any of those files, I need to output that information, whether to a file or
> the screen isn't important right now.
awk -F, '
NR == FNR {
if ($10 ~ /[0-9][0-9][0-9][0-9][0-9](-|[0-9])[0-9][0-9]/)
pats[$10]
next
}
{
for (pat in pats)
if ($0 ~ pat)
delete pats[pat]
}
END {
for (pat in pats)
printf "%s is not in any file\n",pat
}' file /archives/*
> Thank you all in advance for whatever help you can provide.
>
> Greg
If you're still having problems:
a) what version of awk are you using (awk --version)?
b) what are you using for a record separator (RS) and does it appear at
the end of each line in your file?
c) what are you using for a field separator (FS), and does it only
appear between fields in your file?
d) Show a couple of lines of input plus expected output.
Ed.
| |
| Greg Michael 2006-10-30, 7:01 pm |
| "Ed Morton" <morton@lsupcaemnt.com> wrote in message
news:PtmdnZW78_cis9zYnZ2dnUVZ_sGdnZ2d@co
mcast.com...
> awk -F, '$10 ~ /[0-9][0-9][0-9][0-9][0-9](-|[0-9])[0-9][0-9]/{print $10}'
> file
What was causing my problem with the "cannot be longer than 3000 bytes"
problem?
> awk -F, '
> NR == FNR {
> if ($10 ~ /[0-9][0-9][0-9][0-9][0-9](-|[0-9])[0-9][0-9]/)
> pats[$10]
> next
> }
> {
> for (pat in pats)
> if ($0 ~ pat)
> delete pats[pat]
> }
> END {
> for (pat in pats)
> printf "%s is not in any file\n",pat
> }' file /archives/*
>
> If you're still having problems:
>
> a) what version of awk are you using (awk --version)?
> b) what are you using for a record separator (RS) and does it appear at
> the end of each line in your file?
> c) what are you using for a field separator (FS), and does it only appear
> between fields in your file?
> d) Show a couple of lines of input plus expected output.
>
> Ed.
This appears to be working, but... I have 730 lines in the CSV file that
match that pattern, combined with 2306 archive files to search through...
needless to say, it's taking a very long time. Any ideas to speed up the
process?
| |
| Kenny McCormack 2006-10-30, 7:01 pm |
| In article < 5PCdncWOvLUjrd_YnZ2dnUVZ_sednZ2d@comcast
.com>,
Greg Michael <gmichae@comcast.net> wrote:
>"Ed Morton" <morton@lsupcaemnt.com> wrote in message
> news:PtmdnZW78_cis9zYnZ2dnUVZ_sGdnZ2d@co
mcast.com...
>
>What was causing my problem with the "cannot be longer than 3000 bytes"
>problem?
I believe the native HP/UX AWK had that (weird) limitation.
As others have hinted: Get GAWK, you'll be glad you did.
| |
| Ed Morton 2006-10-30, 7:01 pm |
| Greg Michael wrote:
> "Ed Morton" <morton@lsupcaemnt.com> wrote in message
> news:PtmdnZW78_cis9zYnZ2dnUVZ_sGdnZ2d@co
mcast.com...
>
>
>
> What was causing my problem with the "cannot be longer than 3000 bytes"
> problem?
Could be the answer to any of the questions I asked below, or something
else.
>
>
>
> This appears to be working, but... I have 730 lines in the CSV file that
> match that pattern, combined with 2306 archive files to search through...
> needless to say, it's taking a very long time. Any ideas to speed up the
> process?
Not in awk. Your OS may have faster commands to GREP (hint) a pattern
from a bunch of files. If so, could do something like this very
UNIX-like "pseudo-code":
awk -F, '$10 ~ /[0-9][0-9][0-9][0-9][0-9](-|[0-9])[0-9][0-9]/) { print
$10 }' file |
while read pat
do
grep "$pat" /archives/* >/dev/null ||
printf "$pat is not in any file\n"
done
Regards,
If you don't know how to do this in your OS, post a followup to an
OS-specific group, e.g. comp.unix.shell if it's UNIX.
Ed.
| |
| Greg Michael 2006-10-30, 7:01 pm |
| "Ed Morton" <morton@lsupcaemnt.com> wrote in message
news:V_CdnZpjXIZ71t_YnZ2dnUVZ_tmdnZ2d@co
mcast.com...
> Not in awk. Your OS may have faster commands to GREP (hint) a pattern from
> a bunch of files. If so, could do something like this very UNIX-like
> "pseudo-code":
>
> awk -F, '$10 ~ /[0-9][0-9][0-9][0-9][0-9](-|[0-9])[0-9][0-9]/) { print
> $10 }' file |
> while read pat
> do
> grep "$pat" /archives/* >/dev/null ||
> printf "$pat is not in any file\n"
> done
>
> Regards,
>
> If you don't know how to do this in your OS, post a followup to an
> OS-specific group, e.g. comp.unix.shell if it's UNIX.
>
> Ed.
That worked splendidly! Thanks Ed!
|
|
|
|
|