Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

gawk RS and newlines
Say I have this file:

NUM 1
a
b
NUM 2
c
d
NUM 3
e
f

and I'd like to produce this output:

------
NUM  1
a
b
------
NUM  2
c
d
------
NUM  3
e
f

I can do the above by setting my RS, e.g.:

gawk -vRS="NUM " 'NR>1{printf "------\n%s %s",RS,$0}'

Now what if "NUM" can appear in the text? e.g. if my input file is:

NUM 1
a
b
NUM 2
c
d NUM nonesense
NUM 3
e
f

then my script above will produce:

------
NUM  1
a
b
------
NUM  2
c
d ------
NUM  nonesense
------
NUM  3
e
f

instead of what I really want:

------
NUM  1
a
b
------
NUM  2
c
d NUM nonsense
------
NUM  3
e
f

What I REALLY want is a way to specify that my RS is just NUM at the
start of a line, so I'd like to write:

gawk -vRS="^NUM " 'NR>1{printf "------\n%s %s",RS,$0}'

but that doesn't work:

------
^NUM  1
a
b
NUM 2
c
d NUM nonesense
NUM 3
e
f

and neither does using a newline before NUM:

gawk -vRS="\nNUM " 'NR>1{printf "------\n%s %s",RS,$0}'

produces:

------

NUM  2
c
d NUM nonesense------

NUM  3
e
f

I can come up with a hack to get around this, but what's the right way
to do it? I'm using gawk 3.0.4.

Ed.

Report this thread to moderator Post Follow-up to this message
Old Post
Ed Morton
12-16-04 01:55 AM


Re: gawk RS and newlines
In article <cpkt7r$ct0@netnews.proxy.lucent.com>,
Ed Morton  <morton@lsupcaemnt.com> wrote:
...
>Now what if "NUM" can appear in the text? e.g. if my input file is:
>
>NUM 1
>a
>b
>NUM 2
>c
>d NUM nonesense
>NUM 3
>e
>f
>
>then my script above will produce:
>
>------
>NUM  1
>a
>b
>------
>NUM  2
>c
>d ------
>NUM  nonesense
>------
>NUM  3
>e
>f

I don't think there is any "clean" way to do this.  Whatever you come up
with will be a hack/workaround of one form or another.  I played with it
a bit and this is what I found:

1) The problem with using a "^" is that that matches only at the
very beginning of the file.

2) The problem with leading with a "\n" is that it matches
everywhere you want it to *except* at the beginning of the file.
So, one idea is to set it on the cmd line to "NUM " and then
prepend a newline in the body, but this get ugly/unclean real
fast.

My advice in cases like this is to leave RS alone, so that you're reading
lines in a line at a time like you normally expect, and then handle things
from there.  I very rarely change RS.


Report this thread to moderator Post Follow-up to this message
Old Post
Kenny McCormack
12-16-04 01:55 AM


Re: gawk RS and newlines

Kenny McCormack wrote:
> In article <cpkt7r$ct0@netnews.proxy.lucent.com>,
> Ed Morton  <morton@lsupcaemnt.com> wrote:
> ...
> 
>
>
> I don't think there is any "clean" way to do this.  Whatever you come up
> with will be a hack/workaround of one form or another.  I played with it
> a bit and this is what I found:
>
> 	1) The problem with using a "^" is that that matches only at the
> 	   very beginning of the file.

Any idea why? I couldn't find any RS documentation that explained that.

> 	2) The problem with leading with a "\n" is that it matches
> 	   everywhere you want it to *except* at the beginning of the file.
> 	   So, one idea is to set it on the cmd line to "NUM " and then
> 	   prepend a newline in the body, but this get ugly/unclean real
> 	   fast.
>
> My advice in cases like this is to leave RS alone, so that you're reading
> lines in a line at a time like you normally expect, and then handle things
> from there.  I very rarely change RS.
>

I occasionally use RS. This particular situation I've come across a
couple of times and just worked around it as you suggest. It's not a big
problem, but I'd like to understand exactly why "^<text>" doesn't work
as I expected.

Ed.

Report this thread to moderator Post Follow-up to this message
Old Post
Ed Morton
12-16-04 01:55 AM


Re: gawk RS and newlines

Stepan Kasal wrote:

> Hi,
>
> 
>
>
> I'd prefer seting RS="\n"rs and then match("^"rs,$0) for NR==1.

So, something like this would work if we KNOW that each record,
inclusing the first, does start with "NUM ":

gawk 'BEGIN{rs="NUM ";RS="\n"rs}NR==1{sub(rs,"")}
{printf"----\n%s%s\n",rs,$0}'

Not bad, though you'd have to set RT manually for the first line if you
want to use it in the main body.

Ed.

> Using "^" in RS can be dangerous.
> See also the bug report posted a minute ago.
>
> Have a nice day,
>         Stepan
>

Report this thread to moderator Post Follow-up to this message
Old Post
Ed Morton
12-17-04 01:55 PM


Re: gawk RS and newlines
[I'll mail this to bug-gawk.]

Hello,
I've noticed a problem with "^" in RS in gawk.  In most cases, it seems
to match only the beginning of the file.  But in fact it matches the
beginning of gawk's internal buffer.

Observe the following example:

$ gawk 'BEGIN{for(i=1;i<=100;i++) print "Axxxxxx"}' >file
$ gawk 'BEGIN{RS="^A"} END{print NR}' file
2
$ gawk 'BEGIN{RS="^Ax*\n"} END{print NR}' file
100
$ head file | gawk 'BEGIN{RS="^Ax*\n"} END{print NR}'
10
$

I think this calls for some clarification/fix.  But I don't have any
fixed opinion how the solution should look like.

Have a nice day,
Stepan Kasal

Report this thread to moderator Post Follow-up to this message
Old Post
Stepan Kasal
12-17-04 08:56 PM


Re: gawk RS and newlines
Ed Morton wrote:
>
> Now what if "NUM" can appear in the text? e.g. if my input file is:
>
> NUM 1
> a
> b
> NUM 2
> c
> d NUM nonesense
> NUM 3
> e
> f

awk '/^NUM/ { print "------" } 1'

will do the job.

> What I REALLY want is a way to specify that my RS is just NUM at the
> start of a line, so I'd like to write:

Any reason why to use RS?

Janis

Report this thread to moderator Post Follow-up to this message
Old Post
Janis Papanagnou
12-18-04 05:47 PM


Re: gawk RS and newlines

John DuBois wrote:

> In article <cpkt7r$ct0@netnews.proxy.lucent.com>,
> Ed Morton  <morton@lsupcaemnt.com> wrote:
> ...
> 
>
> ...
> 
>
>
> This is only the 'right way' if you have some other reason to be splitting
 into
> records in this particular way, but in any case you could do:
>
> gawk -vRS='(^|\n)NUM ' '{sub("NUM","------\nNUM",RT); printf "%s%s", $0, R
T}'
>
> 	John

I guess another variation on that theme would be:

gawk 'BEGIN{rs="NUM ";RS="(^|\n)"rs}NR==1{next}
{printf"----\n%s%s\n",rs,$0}'

It's slightly more typing, but you don't need the "sub" for every record
and the actual RS text ("NUM ") is only specified in once place instead
of 3 so it's easy to change later.

Thanks for the tip,

Ed.

Report this thread to moderator Post Follow-up to this message
Old Post
Ed Morton
12-18-04 05:47 PM


Sponsored Links




Last Thread Next Thread Next
Search this forum -> 
Post New Thread

AWK archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 07:35 AM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.