Home > Archive > AWK > December 2004 > gawk RS and newlines
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
gawk RS and newlines
|
|
| Ed Morton 2004-12-15, 8:55 pm |
| Say I have this file:
NUM 1
a
b
NUM 2
c
d
NUM 3
e
f
and I'd like to produce this output:
------
NUM 1
a
b
------
NUM 2
c
d
------
NUM 3
e
f
I can do the above by setting my RS, e.g.:
gawk -vRS="NUM " 'NR>1{printf "------\n%s %s",RS,$0}'
Now what if "NUM" can appear in the text? e.g. if my input file is:
NUM 1
a
b
NUM 2
c
d NUM nonesense
NUM 3
e
f
then my script above will produce:
------
NUM 1
a
b
------
NUM 2
c
d ------
NUM nonesense
------
NUM 3
e
f
instead of what I really want:
------
NUM 1
a
b
------
NUM 2
c
d NUM nonsense
------
NUM 3
e
f
What I REALLY want is a way to specify that my RS is just NUM at the
start of a line, so I'd like to write:
gawk -vRS="^NUM " 'NR>1{printf "------\n%s %s",RS,$0}'
but that doesn't work:
------
^NUM 1
a
b
NUM 2
c
d NUM nonesense
NUM 3
e
f
and neither does using a newline before NUM:
gawk -vRS="\nNUM " 'NR>1{printf "------\n%s %s",RS,$0}'
produces:
------
NUM 2
c
d NUM nonesense------
NUM 3
e
f
I can come up with a hack to get around this, but what's the right way
to do it? I'm using gawk 3.0.4.
Ed.
| |
| Kenny McCormack 2004-12-15, 8:55 pm |
| In article <cpkt7r$ct0@netnews.proxy.lucent.com>,
Ed Morton <morton@lsupcaemnt.com> wrote:
....
>Now what if "NUM" can appear in the text? e.g. if my input file is:
>
>NUM 1
>a
>b
>NUM 2
>c
>d NUM nonesense
>NUM 3
>e
>f
>
>then my script above will produce:
>
>------
>NUM 1
>a
>b
>------
>NUM 2
>c
>d ------
>NUM nonesense
>------
>NUM 3
>e
>f
I don't think there is any "clean" way to do this. Whatever you come up
with will be a hack/workaround of one form or another. I played with it
a bit and this is what I found:
1) The problem with using a "^" is that that matches only at the
very beginning of the file.
2) The problem with leading with a "\n" is that it matches
everywhere you want it to *except* at the beginning of the file.
So, one idea is to set it on the cmd line to "NUM " and then
prepend a newline in the body, but this get ugly/unclean real
fast.
My advice in cases like this is to leave RS alone, so that you're reading
lines in a line at a time like you normally expect, and then handle things
from there. I very rarely change RS.
| |
| Ed Morton 2004-12-15, 8:55 pm |
|
Kenny McCormack wrote:
> In article <cpkt7r$ct0@netnews.proxy.lucent.com>,
> Ed Morton <morton@lsupcaemnt.com> wrote:
> ...
>
>
>
> I don't think there is any "clean" way to do this. Whatever you come up
> with will be a hack/workaround of one form or another. I played with it
> a bit and this is what I found:
>
> 1) The problem with using a "^" is that that matches only at the
> very beginning of the file.
Any idea why? I couldn't find any RS documentation that explained that.
> 2) The problem with leading with a "\n" is that it matches
> everywhere you want it to *except* at the beginning of the file.
> So, one idea is to set it on the cmd line to "NUM " and then
> prepend a newline in the body, but this get ugly/unclean real
> fast.
>
> My advice in cases like this is to leave RS alone, so that you're reading
> lines in a line at a time like you normally expect, and then handle things
> from there. I very rarely change RS.
>
I occasionally use RS. This particular situation I've come across a
couple of times and just worked around it as you suggest. It's not a big
problem, but I'd like to understand exactly why "^<text>" doesn't work
as I expected.
Ed.
| |
| Ed Morton 2004-12-17, 8:55 am |
|
Stepan Kasal wrote:
> Hi,
>
>
>
>
> I'd prefer seting RS="\n"rs and then match("^"rs,$0) for NR==1.
So, something like this would work if we KNOW that each record,
inclusing the first, does start with "NUM ":
gawk 'BEGIN{rs="NUM ";RS="\n"rs}NR==1{sub(rs,"")}
{printf"----\n%s%s\n",rs,$0}'
Not bad, though you'd have to set RT manually for the first line if you
want to use it in the main body.
Ed.
> Using "^" in RS can be dangerous.
> See also the bug report posted a minute ago.
>
> Have a nice day,
> Stepan
>
| |
| Stepan Kasal 2004-12-17, 3:56 pm |
| [I'll mail this to bug-gawk.]
Hello,
I've noticed a problem with "^" in RS in gawk. In most cases, it seems
to match only the beginning of the file. But in fact it matches the
beginning of gawk's internal buffer.
Observe the following example:
$ gawk 'BEGIN{for(i=1;i<=100;i++) print "Axxxxxx"}' >file
$ gawk 'BEGIN{RS="^A"} END{print NR}' file
2
$ gawk 'BEGIN{RS="^Ax*\n"} END{print NR}' file
100
$ head file | gawk 'BEGIN{RS="^Ax*\n"} END{print NR}'
10
$
I think this calls for some clarification/fix. But I don't have any
fixed opinion how the solution should look like.
Have a nice day,
Stepan Kasal
| |
| Janis Papanagnou 2004-12-18, 12:47 pm |
| Ed Morton wrote:
>
> Now what if "NUM" can appear in the text? e.g. if my input file is:
>
> NUM 1
> a
> b
> NUM 2
> c
> d NUM nonesense
> NUM 3
> e
> f
awk '/^NUM/ { print "------" } 1'
will do the job.
> What I REALLY want is a way to specify that my RS is just NUM at the
> start of a line, so I'd like to write:
Any reason why to use RS?
Janis
| |
| Ed Morton 2004-12-18, 12:47 pm |
|
John DuBois wrote:
> In article <cpkt7r$ct0@netnews.proxy.lucent.com>,
> Ed Morton <morton@lsupcaemnt.com> wrote:
> ...
>
>
> ...
>
>
>
> This is only the 'right way' if you have some other reason to be splitting into
> records in this particular way, but in any case you could do:
>
> gawk -vRS='(^|\n)NUM ' '{sub("NUM","------\nNUM",RT); printf "%s%s", $0, RT}'
>
> John
I guess another variation on that theme would be:
gawk 'BEGIN{rs="NUM ";RS="(^|\n)"rs}NR==1{next}
{printf"----\n%s%s\n",rs,$0}'
It's slightly more typing, but you don't need the "sub" for every record
and the actual RS text ("NUM ") is only specified in once place instead
of 3 so it's easy to change later.
Thanks for the tip,
Ed.
|
|
|
|
|