Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

^(From) finds only the FIRST occurance when used in RS?? why?
running gawk under win32 (haven't tested under *nix);

source file is a typical email text file w/ several messages in it.
(assume there is only 1 "^(From)" per message header - though "From"
may also be in the middle of a line)

I want to use the RE "^(From)" as my RS.

IF I use:
{
if ($0 ~ /^(From)/ )
print "\n---\n"$0"\n===\n";
}

as my gawk pattern, it DOES correctly print only the lines where From
is at the begining of line.

HOWEVER - if I use:
BEGIN { RS="^(From)"; }  (tried adding  FS="\n"; )
{
print "\n---\n"$0"\n===\n";
}

I EXPECT that each iteration of $0 SHOULD be a full message.
in fact, I only get 2 iterations; the 1st one is blank, the 2nd one
contains the entire rest of the file. (I'm working with a test file of
only 4 msgs - I'm sure this would die on large files)

I've done regex testing with egrep successfully, but can't seem to
replicate what I want using gawk.

I'm obviously missing a citical piece of logic here - I'm hoping that
not only could some kind soul show me the err of my ways, but fix the
"crick" in my logic/understanding of regex's.... (with regard to gawk)

tia - Bob



Report this thread to moderator Post Follow-up to this message
Old Post
Bob
03-20-04 01:24 AM


Re: ^(From) finds only the FIRST occurance when used in RS?? why?
In article <vo5c40lauk3rdufnvqdub34aoj2u988j3u@4ax.com>,
Bob  <nospam_nsh@starnetwx.net> wrote:
>running gawk under win32 (haven't tested under *nix);
>
>source file is a typical email text file w/ several messages in it.
>(assume there is only 1 "^(From)" per message header - though "From"
>may also be in the middle of a line)
>
>I want to use the RE "^(From)" as my RS.

I would suggest that you not do this, at least until you fully understand
the implications.  That is, the implications of changing the standard
variables in general, and this one in particular.

I've always found it easier to leave the defaults as is (except FS, and
that only when the input data is very regular, such as the Unix password
file), and write my own routines to handle things.

Here's how I would do the email parse:

/^(From)/ { p() }
{ s = s $0 "\n" }
END	{ p() }

function p() {
if (!s) return
.. do whatever with s ...
s = ""
}

I've used this idiom consistently over the years, and it is a good one.


Report this thread to moderator Post Follow-up to this message
Old Post
Kenny McCormack
03-20-04 01:24 AM


Re: ^(From) finds only the FIRST occurance when used in RS?? why?
On Wed, 03 Mar 2004 18:53:20 GMT, gazelle@yin.interaccess.com (Kenny
McCormack) wrote:

>In article <vo5c40lauk3rdufnvqdub34aoj2u988j3u@4ax.com>,
>Bob  <nospam_nsh@starnetwx.net> wrote: 
>
>I would suggest that you not do this, at least until you fully understand
>the implications.  That is, the implications of changing the standard
>variables in general, and this one in particular.
>
>I've always found it easier to leave the defaults as is (except FS, and
>that only when the input data is very regular, such as the Unix password
>file), and write my own routines to handle things.
>
>Here's how I would do the email parse:
>
>/^(From)/ { p() }
>{ s = s $0 "\n" }
>END	{ p() }
>
>function p() {
>	if (!s) return
>	... do whatever with s ...
>	s = ""
>	}
>
>I've used this idiom consistently over the years, and it is a good one.

Kenny;
tx very much for your reply -
your way is certainly worth pursuing, and since I can get it to work
alot faster than my way.....

but - I'm still curious to know why my way isn't working...
in addition - even if I do use your way;
when I insert the "^" into the RE, things fall apart. I only get 1
"record" [string] printed, and it contains the whole file. apparently
the ^(From) only matches ONE match - I need to match every instance.

here's what I'm doing:  (in this example - ^(stop) marks the end of
the headers)

BEGIN { c=1; }   //  ('record' counter)
{
print "** "c" **";
if ($0 ~ /^(From)(.*[ :space:]*)*^(stop)/ ) grep();
//  I've also tried substituting [ \t\f\n\r\v] for [:space:]
//  and many other permutations for that matter.
s=s$0;
c++
}
END { grep(); }

function grep() {
if (s) print "\n---\n"s"\n===\n";
s="";
}


What I REALLY want to do is "highlight" or extract all headers between
"^(From) and ^(stop)" - more to the point I really want everything
EXCEPT the headers - but I can easily NOT the RE after it works.

here's another attempt that get's a little better:

BEGIN { c=1; }
{
print "** "c" **";
if ($0 ~ /(rule)(.*)|[:space:]*incoming/ ) grep();
s=s$0;
c++
}
END { grep(); }

function grep() {
if (s) print "\n---\n"s"\n===\n";
s="";
}


This grabs the two lines including ^(From) & ^(stop), but obviously
nothing in between them....



Report this thread to moderator Post Follow-up to this message
Old Post
Bob
03-20-04 01:24 AM


Re: ^(From) finds only the FIRST occurance when used in RS?? why?
Sorry - I copied the wrong file on the last awk pattern.

On Wed, 03 Mar 2004 18:53:20 GMT, gazelle@yin.interaccess.com (Kenny
McCormack) wrote:

>In article <vo5c40lauk3rdufnvqdub34aoj2u988j3u@4ax.com>,
>Bob  <nospam_nsh@starnetwx.net> wrote: 
>
>I would suggest that you not do this, at least until you fully understand
>the implications.  That is, the implications of changing the standard
>variables in general, and this one in particular.
>
>I've always found it easier to leave the defaults as is (except FS, and
>that only when the input data is very regular, such as the Unix password
>file), and write my own routines to handle things.
>
>Here's how I would do the email parse:
>
>/^(From)/ { p() }
>{ s = s $0 "\n" }
>END	{ p() }
>
>function p() {
>	if (!s) return
>	... do whatever with s ...
>	s = ""
>	}
>
>I've used this idiom consistently over the years, and it is a good one.

Kenny;
tx very much for your reply -
your way is certainly worth pursuing, and since I can get it to work
alot faster than my way.....

but - I'm still curious to know why my way isn't working...
in addition - even if I do use your way;
when I insert the "^" into the RE, things fall apart. I only get 1
"record" [string] printed, and it contains the whole file. apparently
the ^(From) only matches ONE match - I need to match every instance.

here's what I'm doing:  (in this example - ^(stop) marks the end of
the headers)

BEGIN { c=1; }   //  ('record' counter)
{
print "** "c" **";
if ($0 ~ /^(From)(.*[ :space:]*)*^(stop)/ ) grep();
//  I've also tried substituting [ \t\f\n\r\v] for [:space:]
//  and many other permutations for that matter.
s=s$0;
c++
}
END { grep(); }

function grep() {
if (s) print "\n---\n"s"\n===\n";
s="";
}


What I REALLY want to do is "highlight" or extract all headers between
"^(From) and ^(stop)" - more to the point I really want everything
EXCEPT the headers - but I can easily NOT the RE after it works.

here's another attempt that get's a little better: (but it's NOT using
"^")

BEGIN { c=1; }
{
print "** "c" **";
if ($0 ~ /(From)(.*)|[:space:]*stop/ ) grep();
s=s$0;
c++
}
END { grep(); }

function grep() {
if (s) print "\n---\n"s"\n===\n";
s="";
}


This grabs the two lines including ^(From) & ^(stop), but obviously
nothing in between them....




Report this thread to moderator Post Follow-up to this message
Old Post
Bob
03-20-04 01:24 AM


Re: ^(From) finds only the FIRST occurance when used in RS?? why?
On Wed, 03 Mar 2004 14:24:41 -0600, Bob <nospam_nsh@starnetwx.net>
wrote:


>but - I'm still curious to know why my way isn't working...
>in addition - even if I do use your way;
>when I insert the "^" into the RE, things fall apart. I only get 1
>"record" [string] printed, and it contains the whole file. apparently
>the ^(From) only matches ONE match - I need to match every instance.

It is working correctly.  A regex applies to the string it is
presented - RS is presented with the *entire* file so ^From matches
*only* the first occurance int the file.  Think about it: if RS is
FROM, then the concept of lines (records) in the usual sense simply
doesn't exist.

I can almost visualize how it works: I would expect it to match only
From at the very beginning of the file - I suspect that since you got
a blank and everything else, the file begins with From.

I think you want a pattern consisting of \nFrom: wherever it occurs in
the file, but you may have to be explicit about \n - you may have to
encode it as octal escape sequences for both bytes in the Windows \n.



T.E.D. (tdavis@gearbox.maem.umr.edu)
SPAM filter: Messages to this address *must* contain "T.E.D."
somewhere in the body or they will be automatically rejected.

Report this thread to moderator Post Follow-up to this message
Old Post
Ted Davis
03-20-04 01:24 AM


Re: ^(From) finds only the FIRST occurance when used in RS?? why?
On Wed, 03 Mar 2004 15:44:18 -0600, Ted Davis
<tdavis@gearbox.maem.umr.edu> wrote:
>
> I think you want a pattern consisting of \nFrom: wherever it occurs in
> the file, but you may have to be explicit about \n - you may have to
> encode it as octal escape sequences for both bytes in the Windows \n.
>

To match the beginning of messages in an mbox file, you want "\nFrom ".

--
Incrsease your earoning poswer and gaerner profwessional resspect.
Get the Un1iversity Dewgree you have already earned.
[from the prestigious, non-accredited University of Spam!]

Report this thread to moderator Post Follow-up to this message
Old Post
Bill Marcum
03-20-04 01:24 AM


Re: ^(From) finds only the FIRST occurance when used in RS?? why?
In article <pphqh1-l83.ln1@don.localnet>,
Bill Marcum  <bmarcum@iglou.com.urgent> wrote:
% On Wed, 03 Mar 2004 15:44:18 -0600, Ted Davis
%   <tdavis@gearbox.maem.umr.edu> wrote:
% >
% > I think you want a pattern consisting of \nFrom: wherever it occurs in
% > the file, but you may have to be explicit about \n - you may have to
% > encode it as octal escape sequences for both bytes in the Windows \n.
% >
%
% To match the beginning of messages in an mbox file, you want "\nFrom ".

More specifically, you want

\nFrom [^ ]+ [A-Z][a-z]{2} [A-Z][a-z]{2} [
 0-9][0-9] [0-9]{2}:[0-9]{2}:[0-9]{2} [0-
9]{4}\n

although some applications screw up the date formatting -- I think the
best thing is to treat such mailboxes as corrupt, fix the format and
stop using such broken applications (and don't write any -- this is the
reason for my reply). Note that the first instance will not have a
preceding \n -- I would anchor this with ^ and $, but I don't see how
that can work in an RS.
--

Patrick TJ McPhee
East York  Canada
ptjm@interlog.com

Report this thread to moderator Post Follow-up to this message
Old Post
Patrick TJ McPhee
03-20-04 01:24 AM


Re: ^(From) finds only the FIRST occurance when used in RS?? why?
Patrick TJ McPhee <ptjm@interlog.com> wrote:
> In article <pphqh1-l83.ln1@don.localnet>,
> Bill Marcum  <bmarcum@iglou.com.urgent> wrote:
> % On Wed, 03 Mar 2004 15:44:18 -0600, Ted Davis
> %   <tdavis@gearbox.maem.umr.edu> wrote:
> % >
> % > I think you want a pattern consisting of \nFrom: wherever it occurs in
> % > the file, but you may have to be explicit about \n - you may have to
> % > encode it as octal escape sequences for both bytes in the Windows \n.
> % >
> %
> % To match the beginning of messages in an mbox file, you want "\nFrom ".
>
> More specifically, you want
>
>  \nFrom [^ ]+ [A-Z][a-z]{2} [A-Z][a-z]{2} &#
91; 0-9][0-9] [0-9]{2}:[0-9]{2}:[0-9]{2} [
;0-9]{4}\n
>
> although some applications screw up the date formatting -- I think the
> best thing is to treat such mailboxes as corrupt, fix the format and
> stop using such broken applications (and don't write any -- this is the
> reason for my reply). Note that the first instance will not have a
> preceding \n -- I would anchor this with ^ and $, but I don't see how
> that can work in an RS.

Or, use 'formail' :-)

--
William Park, Open Geometry Consulting, <opengeometry@yahoo.ca>
Linux solution for data processing and document management.

Report this thread to moderator Post Follow-up to this message
Old Post
William Park
03-20-04 01:24 AM


Re: ^(From) finds only the FIRST occurance when used in RS?? why?
On Wed, 03 Mar 2004 14:27:52 -0600, Bob <nospam_nsh@starnetwx.net>
wrote:
 
[snip] 
>but - I'm still curious to know why my way isn't working...
>in addition - even if I do use your way;
>when I insert the "^" into the RE, things fall apart. I only get 1
>"record" [string] printed, and it contains the whole file. apparently
>the ^(From) only matches ONE match - I need to match every instance.
>

just wanted to thank all who responded!

I ended up going back to using the RS variable, and because I couldn't
use "^" (because it only matches the whole file once); I ended up
making a really long (and nutty) RE to ensure I wasn't going to match
the "end phrase" when it really wasnt....

it turned out to be easier to do this, because I didn't have to write
a whole lot of logic to deal with a variable number of lines between
the "from" and "end"....

tx again - Bob



Report this thread to moderator Post Follow-up to this message
Old Post
Bob
03-20-04 01:24 AM


Sponsored Links




Last Thread Next Thread Next
Search this forum -> 
Post New Thread

AWK archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 05:03 AM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.