Code Comments
Programming Forum and web based access to our favorite programming groups.running gawk under win32 (haven't tested under *nix);
source file is a typical email text file w/ several messages in it.
(assume there is only 1 "^(From)" per message header - though "From"
may also be in the middle of a line)
I want to use the RE "^(From)" as my RS.
IF I use:
{
if ($0 ~ /^(From)/ )
print "\n---\n"$0"\n===\n";
}
as my gawk pattern, it DOES correctly print only the lines where From
is at the begining of line.
HOWEVER - if I use:
BEGIN { RS="^(From)"; } (tried adding FS="\n"; )
{
print "\n---\n"$0"\n===\n";
}
I EXPECT that each iteration of $0 SHOULD be a full message.
in fact, I only get 2 iterations; the 1st one is blank, the 2nd one
contains the entire rest of the file. (I'm working with a test file of
only 4 msgs - I'm sure this would die on large files)
I've done regex testing with egrep successfully, but can't seem to
replicate what I want using gawk.
I'm obviously missing a citical piece of logic here - I'm hoping that
not only could some kind soul show me the err of my ways, but fix the
"crick" in my logic/understanding of regex's.... (with regard to gawk)
tia - Bob
Post Follow-up to this messageIn article <vo5c40lauk3rdufnvqdub34aoj2u988j3u@4ax.com>,
Bob <nospam_nsh@starnetwx.net> wrote:
>running gawk under win32 (haven't tested under *nix);
>
>source file is a typical email text file w/ several messages in it.
>(assume there is only 1 "^(From)" per message header - though "From"
>may also be in the middle of a line)
>
>I want to use the RE "^(From)" as my RS.
I would suggest that you not do this, at least until you fully understand
the implications. That is, the implications of changing the standard
variables in general, and this one in particular.
I've always found it easier to leave the defaults as is (except FS, and
that only when the input data is very regular, such as the Unix password
file), and write my own routines to handle things.
Here's how I would do the email parse:
/^(From)/ { p() }
{ s = s $0 "\n" }
END { p() }
function p() {
if (!s) return
.. do whatever with s ...
s = ""
}
I've used this idiom consistently over the years, and it is a good one.
Post Follow-up to this messageOn Wed, 03 Mar 2004 18:53:20 GMT, gazelle@yin.interaccess.com (Kenny
McCormack) wrote:
>In article <vo5c40lauk3rdufnvqdub34aoj2u988j3u@4ax.com>,
>Bob <nospam_nsh@starnetwx.net> wrote:
>
>I would suggest that you not do this, at least until you fully understand
>the implications. That is, the implications of changing the standard
>variables in general, and this one in particular.
>
>I've always found it easier to leave the defaults as is (except FS, and
>that only when the input data is very regular, such as the Unix password
>file), and write my own routines to handle things.
>
>Here's how I would do the email parse:
>
>/^(From)/ { p() }
>{ s = s $0 "\n" }
>END { p() }
>
>function p() {
> if (!s) return
> ... do whatever with s ...
> s = ""
> }
>
>I've used this idiom consistently over the years, and it is a good one.
Kenny;
tx very much for your reply -
your way is certainly worth pursuing, and since I can get it to work
alot faster than my way.....
but - I'm still curious to know why my way isn't working...
in addition - even if I do use your way;
when I insert the "^" into the RE, things fall apart. I only get 1
"record" [string] printed, and it contains the whole file. apparently
the ^(From) only matches ONE match - I need to match every instance.
here's what I'm doing: (in this example - ^(stop) marks the end of
the headers)
BEGIN { c=1; } // ('record' counter)
{
print "** "c" **";
if ($0 ~ /^(From)(.*[ :space:]*)*^(stop)/ ) grep();
// I've also tried substituting [ \t\f\n\r\v] for [:space:]
// and many other permutations for that matter.
s=s$0;
c++
}
END { grep(); }
function grep() {
if (s) print "\n---\n"s"\n===\n";
s="";
}
What I REALLY want to do is "highlight" or extract all headers between
"^(From) and ^(stop)" - more to the point I really want everything
EXCEPT the headers - but I can easily NOT the RE after it works.
here's another attempt that get's a little better:
BEGIN { c=1; }
{
print "** "c" **";
if ($0 ~ /(rule)(.*)|[:space:]*incoming/ ) grep();
s=s$0;
c++
}
END { grep(); }
function grep() {
if (s) print "\n---\n"s"\n===\n";
s="";
}
This grabs the two lines including ^(From) & ^(stop), but obviously
nothing in between them....
Post Follow-up to this messageSorry - I copied the wrong file on the last awk pattern.
On Wed, 03 Mar 2004 18:53:20 GMT, gazelle@yin.interaccess.com (Kenny
McCormack) wrote:
>In article <vo5c40lauk3rdufnvqdub34aoj2u988j3u@4ax.com>,
>Bob <nospam_nsh@starnetwx.net> wrote:
>
>I would suggest that you not do this, at least until you fully understand
>the implications. That is, the implications of changing the standard
>variables in general, and this one in particular.
>
>I've always found it easier to leave the defaults as is (except FS, and
>that only when the input data is very regular, such as the Unix password
>file), and write my own routines to handle things.
>
>Here's how I would do the email parse:
>
>/^(From)/ { p() }
>{ s = s $0 "\n" }
>END { p() }
>
>function p() {
> if (!s) return
> ... do whatever with s ...
> s = ""
> }
>
>I've used this idiom consistently over the years, and it is a good one.
Kenny;
tx very much for your reply -
your way is certainly worth pursuing, and since I can get it to work
alot faster than my way.....
but - I'm still curious to know why my way isn't working...
in addition - even if I do use your way;
when I insert the "^" into the RE, things fall apart. I only get 1
"record" [string] printed, and it contains the whole file. apparently
the ^(From) only matches ONE match - I need to match every instance.
here's what I'm doing: (in this example - ^(stop) marks the end of
the headers)
BEGIN { c=1; } // ('record' counter)
{
print "** "c" **";
if ($0 ~ /^(From)(.*[ :space:]*)*^(stop)/ ) grep();
// I've also tried substituting [ \t\f\n\r\v] for [:space:]
// and many other permutations for that matter.
s=s$0;
c++
}
END { grep(); }
function grep() {
if (s) print "\n---\n"s"\n===\n";
s="";
}
What I REALLY want to do is "highlight" or extract all headers between
"^(From) and ^(stop)" - more to the point I really want everything
EXCEPT the headers - but I can easily NOT the RE after it works.
here's another attempt that get's a little better: (but it's NOT using
"^")
BEGIN { c=1; }
{
print "** "c" **";
if ($0 ~ /(From)(.*)|[:space:]*stop/ ) grep();
s=s$0;
c++
}
END { grep(); }
function grep() {
if (s) print "\n---\n"s"\n===\n";
s="";
}
This grabs the two lines including ^(From) & ^(stop), but obviously
nothing in between them....
Post Follow-up to this messageOn Wed, 03 Mar 2004 14:24:41 -0600, Bob <nospam_nsh@starnetwx.net> wrote: >but - I'm still curious to know why my way isn't working... >in addition - even if I do use your way; >when I insert the "^" into the RE, things fall apart. I only get 1 >"record" [string] printed, and it contains the whole file. apparently >the ^(From) only matches ONE match - I need to match every instance. It is working correctly. A regex applies to the string it is presented - RS is presented with the *entire* file so ^From matches *only* the first occurance int the file. Think about it: if RS is FROM, then the concept of lines (records) in the usual sense simply doesn't exist. I can almost visualize how it works: I would expect it to match only From at the very beginning of the file - I suspect that since you got a blank and everything else, the file begins with From. I think you want a pattern consisting of \nFrom: wherever it occurs in the file, but you may have to be explicit about \n - you may have to encode it as octal escape sequences for both bytes in the Windows \n. T.E.D. (tdavis@gearbox.maem.umr.edu) SPAM filter: Messages to this address *must* contain "T.E.D." somewhere in the body or they will be automatically rejected.
Post Follow-up to this messageOn Wed, 03 Mar 2004 15:44:18 -0600, Ted Davis <tdavis@gearbox.maem.umr.edu> wrote: > > I think you want a pattern consisting of \nFrom: wherever it occurs in > the file, but you may have to be explicit about \n - you may have to > encode it as octal escape sequences for both bytes in the Windows \n. > To match the beginning of messages in an mbox file, you want "\nFrom ". -- Incrsease your earoning poswer and gaerner profwessional resspect. Get the Un1iversity Dewgree you have already earned. [from the prestigious, non-accredited University of Spam!]
Post Follow-up to this messageIn article <pphqh1-l83.ln1@don.localnet>,
Bill Marcum <bmarcum@iglou.com.urgent> wrote:
% On Wed, 03 Mar 2004 15:44:18 -0600, Ted Davis
% <tdavis@gearbox.maem.umr.edu> wrote:
% >
% > I think you want a pattern consisting of \nFrom: wherever it occurs in
% > the file, but you may have to be explicit about \n - you may have to
% > encode it as octal escape sequences for both bytes in the Windows \n.
% >
%
% To match the beginning of messages in an mbox file, you want "\nFrom ".
More specifically, you want
\nFrom [^ ]+ [A-Z][a-z]{2} [A-Z][a-z]{2} [
0-9][0-9] [0-9]{2}:[0-9]{2}:[0-9]{2} [0-
9]{4}\n
although some applications screw up the date formatting -- I think the
best thing is to treat such mailboxes as corrupt, fix the format and
stop using such broken applications (and don't write any -- this is the
reason for my reply). Note that the first instance will not have a
preceding \n -- I would anchor this with ^ and $, but I don't see how
that can work in an RS.
--
Patrick TJ McPhee
East York Canada
ptjm@interlog.com
Post Follow-up to this messagePatrick TJ McPhee <ptjm@interlog.com> wrote:
> In article <pphqh1-l83.ln1@don.localnet>,
> Bill Marcum <bmarcum@iglou.com.urgent> wrote:
> % On Wed, 03 Mar 2004 15:44:18 -0600, Ted Davis
> % <tdavis@gearbox.maem.umr.edu> wrote:
> % >
> % > I think you want a pattern consisting of \nFrom: wherever it occurs in
> % > the file, but you may have to be explicit about \n - you may have to
> % > encode it as octal escape sequences for both bytes in the Windows \n.
> % >
> %
> % To match the beginning of messages in an mbox file, you want "\nFrom ".
>
> More specifically, you want
>
> \nFrom [^ ]+ [A-Z][a-z]{2} [A-Z][a-z]{2}
91; 0-9][0-9] [0-9]{2}:[0-9]{2}:[0-9]{2} [
;0-9]{4}\n
>
> although some applications screw up the date formatting -- I think the
> best thing is to treat such mailboxes as corrupt, fix the format and
> stop using such broken applications (and don't write any -- this is the
> reason for my reply). Note that the first instance will not have a
> preceding \n -- I would anchor this with ^ and $, but I don't see how
> that can work in an RS.
Or, use 'formail' :-)
--
William Park, Open Geometry Consulting, <opengeometry@yahoo.ca>
Linux solution for data processing and document management.
Post Follow-up to this messageOn Wed, 03 Mar 2004 14:27:52 -0600, Bob <nospam_nsh@starnetwx.net> wrote: [snip] >but - I'm still curious to know why my way isn't working... >in addition - even if I do use your way; >when I insert the "^" into the RE, things fall apart. I only get 1 >"record" [string] printed, and it contains the whole file. apparently >the ^(From) only matches ONE match - I need to match every instance. > just wanted to thank all who responded! I ended up going back to using the RS variable, and because I couldn't use "^" (because it only matches the whole file once); I ended up making a really long (and nutty) RE to ensure I wasn't going to match the "end phrase" when it really wasnt.... it turned out to be easier to do this, because I didn't have to write a whole lot of logic to deal with a variable number of lines between the "from" and "end".... tx again - Bob
Post Follow-up to this messagePowered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.