Home > Archive > AWK > March 2004 > ^(From) finds only the FIRST occurance when used in RS?? why?
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
^(From) finds only the FIRST occurance when used in RS?? why?
|
|
|
| running gawk under win32 (haven't tested under *nix);
source file is a typical email text file w/ several messages in it.
(assume there is only 1 "^(From)" per message header - though "From"
may also be in the middle of a line)
I want to use the RE "^(From)" as my RS.
IF I use:
{
if ($0 ~ /^(From)/ )
print "\n---\n"$0"\n===\n";
}
as my gawk pattern, it DOES correctly print only the lines where From
is at the begining of line.
HOWEVER - if I use:
BEGIN { RS="^(From)"; } (tried adding FS="\n"; )
{
print "\n---\n"$0"\n===\n";
}
I EXPECT that each iteration of $0 SHOULD be a full message.
in fact, I only get 2 iterations; the 1st one is blank, the 2nd one
contains the entire rest of the file. (I'm working with a test file of
only 4 msgs - I'm sure this would die on large files)
I've done regex testing with egrep successfully, but can't seem to
replicate what I want using gawk.
I'm obviously missing a citical piece of logic here - I'm hoping that
not only could some kind soul show me the err of my ways, but fix the
"crick" in my logic/understanding of regex's.... (with regard to gawk)
tia - Bob
| |
| Kenny McCormack 2004-03-19, 8:24 pm |
| In article <vo5c40lauk3rdufnvqdub34aoj2u988j3u@4ax.com>,
Bob <nospam_nsh@starnetwx.net> wrote:
>running gawk under win32 (haven't tested under *nix);
>
>source file is a typical email text file w/ several messages in it.
>(assume there is only 1 "^(From)" per message header - though "From"
>may also be in the middle of a line)
>
>I want to use the RE "^(From)" as my RS.
I would suggest that you not do this, at least until you fully understand
the implications. That is, the implications of changing the standard
variables in general, and this one in particular.
I've always found it easier to leave the defaults as is (except FS, and
that only when the input data is very regular, such as the Unix password
file), and write my own routines to handle things.
Here's how I would do the email parse:
/^(From)/ { p() }
{ s = s $0 "\n" }
END { p() }
function p() {
if (!s) return
... do whatever with s ...
s = ""
}
I've used this idiom consistently over the years, and it is a good one.
| |
|
| On Wed, 03 Mar 2004 18:53:20 GMT, gazelle@yin.interaccess.com (Kenny
McCormack) wrote:
>In article <vo5c40lauk3rdufnvqdub34aoj2u988j3u@4ax.com>,
>Bob <nospam_nsh@starnetwx.net> wrote:
>
>I would suggest that you not do this, at least until you fully understand
>the implications. That is, the implications of changing the standard
>variables in general, and this one in particular.
>
>I've always found it easier to leave the defaults as is (except FS, and
>that only when the input data is very regular, such as the Unix password
>file), and write my own routines to handle things.
>
>Here's how I would do the email parse:
>
>/^(From)/ { p() }
>{ s = s $0 "\n" }
>END { p() }
>
>function p() {
> if (!s) return
> ... do whatever with s ...
> s = ""
> }
>
>I've used this idiom consistently over the years, and it is a good one.
Kenny;
tx very much for your reply -
your way is certainly worth pursuing, and since I can get it to work
alot faster than my way.....
but - I'm still curious to know why my way isn't working...
in addition - even if I do use your way;
when I insert the "^" into the RE, things fall apart. I only get 1
"record" [string] printed, and it contains the whole file. apparently
the ^(From) only matches ONE match - I need to match every instance.
here's what I'm doing: (in this example - ^(stop) marks the end of
the headers)
BEGIN { c=1; } // ('record' counter)
{
print "** "c" **";
if ($0 ~ /^(From)(.*[ :space:]*)*^(stop)/ ) grep();
// I've also tried substituting [ \t\f\n\r\v] for [:space:]
// and many other permutations for that matter.
s=s$0;
c++
}
END { grep(); }
function grep() {
if (s) print "\n---\n"s"\n===\n";
s="";
}
What I REALLY want to do is "highlight" or extract all headers between
"^(From) and ^(stop)" - more to the point I really want everything
EXCEPT the headers - but I can easily NOT the RE after it works.
here's another attempt that get's a little better:
BEGIN { c=1; }
{
print "** "c" **";
if ($0 ~ /(rule)(.*)|[:space:]*incoming/ ) grep();
s=s$0;
c++
}
END { grep(); }
function grep() {
if (s) print "\n---\n"s"\n===\n";
s="";
}
This grabs the two lines including ^(From) & ^(stop), but obviously
nothing in between them....
| |
|
| Sorry - I copied the wrong file on the last awk pattern.
On Wed, 03 Mar 2004 18:53:20 GMT, gazelle@yin.interaccess.com (Kenny
McCormack) wrote:
>In article <vo5c40lauk3rdufnvqdub34aoj2u988j3u@4ax.com>,
>Bob <nospam_nsh@starnetwx.net> wrote:
>
>I would suggest that you not do this, at least until you fully understand
>the implications. That is, the implications of changing the standard
>variables in general, and this one in particular.
>
>I've always found it easier to leave the defaults as is (except FS, and
>that only when the input data is very regular, such as the Unix password
>file), and write my own routines to handle things.
>
>Here's how I would do the email parse:
>
>/^(From)/ { p() }
>{ s = s $0 "\n" }
>END { p() }
>
>function p() {
> if (!s) return
> ... do whatever with s ...
> s = ""
> }
>
>I've used this idiom consistently over the years, and it is a good one.
Kenny;
tx very much for your reply -
your way is certainly worth pursuing, and since I can get it to work
alot faster than my way.....
but - I'm still curious to know why my way isn't working...
in addition - even if I do use your way;
when I insert the "^" into the RE, things fall apart. I only get 1
"record" [string] printed, and it contains the whole file. apparently
the ^(From) only matches ONE match - I need to match every instance.
here's what I'm doing: (in this example - ^(stop) marks the end of
the headers)
BEGIN { c=1; } // ('record' counter)
{
print "** "c" **";
if ($0 ~ /^(From)(.*[ :space:]*)*^(stop)/ ) grep();
// I've also tried substituting [ \t\f\n\r\v] for [:space:]
// and many other permutations for that matter.
s=s$0;
c++
}
END { grep(); }
function grep() {
if (s) print "\n---\n"s"\n===\n";
s="";
}
What I REALLY want to do is "highlight" or extract all headers between
"^(From) and ^(stop)" - more to the point I really want everything
EXCEPT the headers - but I can easily NOT the RE after it works.
here's another attempt that get's a little better: (but it's NOT using
"^")
BEGIN { c=1; }
{
print "** "c" **";
if ($0 ~ /(From)(.*)|[:space:]*stop/ ) grep();
s=s$0;
c++
}
END { grep(); }
function grep() {
if (s) print "\n---\n"s"\n===\n";
s="";
}
This grabs the two lines including ^(From) & ^(stop), but obviously
nothing in between them....
| |
| Ted Davis 2004-03-19, 8:24 pm |
| On Wed, 03 Mar 2004 14:24:41 -0600, Bob <nospam_nsh@starnetwx.net>
wrote:
>but - I'm still curious to know why my way isn't working...
>in addition - even if I do use your way;
>when I insert the "^" into the RE, things fall apart. I only get 1
>"record" [string] printed, and it contains the whole file. apparently
>the ^(From) only matches ONE match - I need to match every instance.
It is working correctly. A regex applies to the string it is
presented - RS is presented with the *entire* file so ^From matches
*only* the first occurance int the file. Think about it: if RS is
FROM, then the concept of lines (records) in the usual sense simply
doesn't exist.
I can almost visualize how it works: I would expect it to match only
From at the very beginning of the file - I suspect that since you got
a blank and everything else, the file begins with From.
I think you want a pattern consisting of \nFrom: wherever it occurs in
the file, but you may have to be explicit about \n - you may have to
encode it as octal escape sequences for both bytes in the Windows \n.
T.E.D. (tdavis@gearbox.maem.umr.edu)
SPAM filter: Messages to this address *must* contain "T.E.D."
somewhere in the body or they will be automatically rejected.
| |
| Bill Marcum 2004-03-19, 8:24 pm |
| On Wed, 03 Mar 2004 15:44:18 -0600, Ted Davis
<tdavis@gearbox.maem.umr.edu> wrote:
>
> I think you want a pattern consisting of \nFrom: wherever it occurs in
> the file, but you may have to be explicit about \n - you may have to
> encode it as octal escape sequences for both bytes in the Windows \n.
>
To match the beginning of messages in an mbox file, you want "\nFrom ".
--
Incrsease your earoning poswer and gaerner profwessional resspect.
Get the Un1iversity Dewgree you have already earned.
[from the prestigious, non-accredited University of Spam!]
| |
| Patrick TJ McPhee 2004-03-19, 8:24 pm |
| In article <pphqh1-l83.ln1@don.localnet>,
Bill Marcum <bmarcum@iglou.com.urgent> wrote:
% On Wed, 03 Mar 2004 15:44:18 -0600, Ted Davis
% <tdavis@gearbox.maem.umr.edu> wrote:
% >
% > I think you want a pattern consisting of \nFrom: wherever it occurs in
% > the file, but you may have to be explicit about \n - you may have to
% > encode it as octal escape sequences for both bytes in the Windows \n.
% >
%
% To match the beginning of messages in an mbox file, you want "\nFrom ".
More specifically, you want
\nFrom [^ ]+ [A-Z][a-z]{2} [A-Z][a-z]{2} [ 0-9][0-9] [0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{4}\n
although some applications screw up the date formatting -- I think the
best thing is to treat such mailboxes as corrupt, fix the format and
stop using such broken applications (and don't write any -- this is the
reason for my reply). Note that the first instance will not have a
preceding \n -- I would anchor this with ^ and $, but I don't see how
that can work in an RS.
--
Patrick TJ McPhee
East York Canada
ptjm@interlog.com
| |
| William Park 2004-03-19, 8:24 pm |
| Patrick TJ McPhee <ptjm@interlog.com> wrote:
> In article <pphqh1-l83.ln1@don.localnet>,
> Bill Marcum <bmarcum@iglou.com.urgent> wrote:
> % On Wed, 03 Mar 2004 15:44:18 -0600, Ted Davis
> % <tdavis@gearbox.maem.umr.edu> wrote:
> % >
> % > I think you want a pattern consisting of \nFrom: wherever it occurs in
> % > the file, but you may have to be explicit about \n - you may have to
> % > encode it as octal escape sequences for both bytes in the Windows \n.
> % >
> %
> % To match the beginning of messages in an mbox file, you want "\nFrom ".
>
> More specifically, you want
>
> \nFrom [^ ]+ [A-Z][a-z]{2} [A-Z][a-z]{2} [ 0-9][0-9] [0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{4}\n
>
> although some applications screw up the date formatting -- I think the
> best thing is to treat such mailboxes as corrupt, fix the format and
> stop using such broken applications (and don't write any -- this is the
> reason for my reply). Note that the first instance will not have a
> preceding \n -- I would anchor this with ^ and $, but I don't see how
> that can work in an RS.
Or, use 'formail' :-)
--
William Park, Open Geometry Consulting, <opengeometry@yahoo.ca>
Linux solution for data processing and document management.
| |
|
| On Wed, 03 Mar 2004 14:27:52 -0600, Bob <nospam_nsh@starnetwx.net>
wrote:
[snip][color=darkred]
>but - I'm still curious to know why my way isn't working...
>in addition - even if I do use your way;
>when I insert the "^" into the RE, things fall apart. I only get 1
>"record" [string] printed, and it contains the whole file. apparently
>the ^(From) only matches ONE match - I need to match every instance.
>
just wanted to thank all who responded!
I ended up going back to using the RS variable, and because I couldn't
use "^" (because it only matches the whole file once); I ended up
making a really long (and nutty) RE to ensure I wasn't going to match
the "end phrase" when it really wasnt....
it turned out to be easier to do this, because I didn't have to write
a whole lot of logic to deal with a variable number of lines between
the "from" and "end"....
tx again - Bob
|
|
|
|
|