For Programmers: Free Programming Magazines  


Home > Archive > AWK > December 2004 > gawk RS and newlines









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author gawk RS and newlines
Ed Morton

2004-12-15, 8:55 pm

Say I have this file:

NUM 1
a
b
NUM 2
c
d
NUM 3
e
f

and I'd like to produce this output:

------
NUM 1
a
b
------
NUM 2
c
d
------
NUM 3
e
f

I can do the above by setting my RS, e.g.:

gawk -vRS="NUM " 'NR>1{printf "------\n%s %s",RS,$0}'

Now what if "NUM" can appear in the text? e.g. if my input file is:

NUM 1
a
b
NUM 2
c
d NUM nonesense
NUM 3
e
f

then my script above will produce:

------
NUM 1
a
b
------
NUM 2
c
d ------
NUM nonesense
------
NUM 3
e
f

instead of what I really want:

------
NUM 1
a
b
------
NUM 2
c
d NUM nonsense
------
NUM 3
e
f

What I REALLY want is a way to specify that my RS is just NUM at the
start of a line, so I'd like to write:

gawk -vRS="^NUM " 'NR>1{printf "------\n%s %s",RS,$0}'

but that doesn't work:

------
^NUM 1
a
b
NUM 2
c
d NUM nonesense
NUM 3
e
f

and neither does using a newline before NUM:

gawk -vRS="\nNUM " 'NR>1{printf "------\n%s %s",RS,$0}'

produces:

------

NUM 2
c
d NUM nonesense------

NUM 3
e
f

I can come up with a hack to get around this, but what's the right way
to do it? I'm using gawk 3.0.4.

Ed.
Kenny McCormack

2004-12-15, 8:55 pm

In article <cpkt7r$ct0@netnews.proxy.lucent.com>,
Ed Morton <morton@lsupcaemnt.com> wrote:
....
>Now what if "NUM" can appear in the text? e.g. if my input file is:
>
>NUM 1
>a
>b
>NUM 2
>c
>d NUM nonesense
>NUM 3
>e
>f
>
>then my script above will produce:
>
>------
>NUM 1
>a
>b
>------
>NUM 2
>c
>d ------
>NUM nonesense
>------
>NUM 3
>e
>f


I don't think there is any "clean" way to do this. Whatever you come up
with will be a hack/workaround of one form or another. I played with it
a bit and this is what I found:

1) The problem with using a "^" is that that matches only at the
very beginning of the file.

2) The problem with leading with a "\n" is that it matches
everywhere you want it to *except* at the beginning of the file.
So, one idea is to set it on the cmd line to "NUM " and then
prepend a newline in the body, but this get ugly/unclean real
fast.

My advice in cases like this is to leave RS alone, so that you're reading
lines in a line at a time like you normally expect, and then handle things
from there. I very rarely change RS.

Ed Morton

2004-12-15, 8:55 pm



Kenny McCormack wrote:
> In article <cpkt7r$ct0@netnews.proxy.lucent.com>,
> Ed Morton <morton@lsupcaemnt.com> wrote:
> ...
>
>
>
> I don't think there is any "clean" way to do this. Whatever you come up
> with will be a hack/workaround of one form or another. I played with it
> a bit and this is what I found:
>
> 1) The problem with using a "^" is that that matches only at the
> very beginning of the file.


Any idea why? I couldn't find any RS documentation that explained that.

> 2) The problem with leading with a "\n" is that it matches
> everywhere you want it to *except* at the beginning of the file.
> So, one idea is to set it on the cmd line to "NUM " and then
> prepend a newline in the body, but this get ugly/unclean real
> fast.
>
> My advice in cases like this is to leave RS alone, so that you're reading
> lines in a line at a time like you normally expect, and then handle things
> from there. I very rarely change RS.
>


I occasionally use RS. This particular situation I've come across a
couple of times and just worked around it as you suggest. It's not a big
problem, but I'd like to understand exactly why "^<text>" doesn't work
as I expected.

Ed.
Ed Morton

2004-12-17, 8:55 am



Stepan Kasal wrote:

> Hi,
>
>
>
>
> I'd prefer seting RS="\n"rs and then match("^"rs,$0) for NR==1.


So, something like this would work if we KNOW that each record,
inclusing the first, does start with "NUM ":

gawk 'BEGIN{rs="NUM ";RS="\n"rs}NR==1{sub(rs,"")}
{printf"----\n%s%s\n",rs,$0}'

Not bad, though you'd have to set RT manually for the first line if you
want to use it in the main body.

Ed.

> Using "^" in RS can be dangerous.
> See also the bug report posted a minute ago.
>
> Have a nice day,
> Stepan
>

Stepan Kasal

2004-12-17, 3:56 pm

[I'll mail this to bug-gawk.]

Hello,
I've noticed a problem with "^" in RS in gawk. In most cases, it seems
to match only the beginning of the file. But in fact it matches the
beginning of gawk's internal buffer.

Observe the following example:

$ gawk 'BEGIN{for(i=1;i<=100;i++) print "Axxxxxx"}' >file
$ gawk 'BEGIN{RS="^A"} END{print NR}' file
2
$ gawk 'BEGIN{RS="^Ax*\n"} END{print NR}' file
100
$ head file | gawk 'BEGIN{RS="^Ax*\n"} END{print NR}'
10
$

I think this calls for some clarification/fix. But I don't have any
fixed opinion how the solution should look like.

Have a nice day,
Stepan Kasal
Janis Papanagnou

2004-12-18, 12:47 pm

Ed Morton wrote:
>
> Now what if "NUM" can appear in the text? e.g. if my input file is:
>
> NUM 1
> a
> b
> NUM 2
> c
> d NUM nonesense
> NUM 3
> e
> f


awk '/^NUM/ { print "------" } 1'

will do the job.

> What I REALLY want is a way to specify that my RS is just NUM at the
> start of a line, so I'd like to write:


Any reason why to use RS?

Janis
Ed Morton

2004-12-18, 12:47 pm



John DuBois wrote:

> In article <cpkt7r$ct0@netnews.proxy.lucent.com>,
> Ed Morton <morton@lsupcaemnt.com> wrote:
> ...
>
>
> ...
>
>
>
> This is only the 'right way' if you have some other reason to be splitting into
> records in this particular way, but in any case you could do:
>
> gawk -vRS='(^|\n)NUM ' '{sub("NUM","------\nNUM",RT); printf "%s%s", $0, RT}'
>
> John


I guess another variation on that theme would be:

gawk 'BEGIN{rs="NUM ";RS="(^|\n)"rs}NR==1{next}
{printf"----\n%s%s\n",rs,$0}'

It's slightly more typing, but you don't need the "sub" for every record
and the actual RS text ("NUM ") is only specified in once place instead
of 3 so it's easy to change later.

Thanks for the tip,

Ed.
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com