For Programmers: Free Programming Magazines  


Home > Archive > AWK > July 2004 > Joining lines









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Joining lines
eldorado

2004-07-29, 3:55 pm

Is there a awk oneliner that would join two lines, only if the second line
has a space at the beginning? ie:

1234
56
789
9876
1

would look like
123456
789
9876
1


--
Randomly generated signature --
We are coming after you. God may have mercy on you, but we won't -- John McCain

Ed Morton

2004-07-29, 3:55 pm



eldorado wrote:

> Is there a awk oneliner that would join two lines, only if the second line
> has a space at the beginning? ie:
>
> 1234
> 56
> 789
> 9876
> 1
>
> would look like
> 123456
> 789
> 9876
> 1
>
>


Something like this (untested):

gawk 'BEGIN{RS="";FS="\n ";OFS=""}$1=$1'

Regards,

Ed.

Kenny McCormack

2004-07-29, 3:55 pm

In article <ceb8ia$1n6@netnews.proxy.lucent.com>,
Ed Morton <morton@lsupcaemnt.com> wrote:
>
>
>eldorado wrote:
>
>
>Something like this (untested):
>
>gawk 'BEGIN{RS="";FS="\n ";OFS=""}$1=$1'


Hey, that works. Very clever!

eldorado

2004-07-29, 3:55 pm

On Thu, 29 Jul 2004, Kenny McCormack wrote:

> In article <ceb8ia$1n6@netnews.proxy.lucent.com>,
> Ed Morton <morton@lsupcaemnt.com> wrote:
>
> Hey, that works. Very clever!
>


Ed, it does work! Thanks.

If you have a moment would you explain what the $1=$1 line does?

--
Randomly generated signature --
The worst thing about censorship is [deleted by censorship bereau].

Ed Morton

2004-07-29, 3:55 pm



eldorado wrote:
> On Thu, 29 Jul 2004, Kenny McCormack wrote:
>
>
<snip>[color=darkred]
>
>
> Ed, it does work! Thanks.
>
> If you have a moment would you explain what the $1=$1 line does?
>


It tells awk to re-evaluate the fields so the resulting $0 has what I
specified as an OFS (i.e. nothing) rather than whatever it originally
had as an FS (i.e. "\n ") and the fact I'm doing it as a condition
invokes the default action behavior of printing $0.

Ed.

Ed.

Kenny McCormack

2004-07-29, 8:55 pm

In article <cebjft$6mc@netnews.proxy.lucent.com>,
Ed Morton <morton@lsupcaemnt.com> wrote:
>
>
>eldorado wrote:
><snip>
>
>It tells awk to re-evaluate the fields so the resulting $0 has what I
>specified as an OFS (i.e. nothing) rather than whatever it originally
>had as an FS (i.e. "\n ") and the fact I'm doing it as a condition
>invokes the default action behavior of printing $0.


Yes, but there is something wrong here. First of all, purists will point
out that the "$1=$1" trick - using that as a cute shorthand for "re-scan
it and then print it" - fails, in the general case, if the $1 is a null
string (since null string evaluates to false). However, as an aside, in
real life, I often maintain that this is actually a feature, since it has
the effect of filtering blank lines out of your output (which is usually
[but, of course, not always] desirable).

Anyway, back to the instant case, what do you think should be the output of
this command:

printf "\n foo" | gawk 'BEGIN{RS="";FS="\n "}{print NR,NF,$1,length($1)}'

I think that NF should be 2, and the length of the "foo field" should be 3,
but in my testing, NF always comes up 1 and the length always comes up 4.

Connecting the dots is left as an exercise for the reader...

Ed Morton

2004-07-29, 8:55 pm



Kenny McCormack wrote:
<snip>
> Anyway, back to the instant case, what do you think should be the output of
> this command:
>
> printf "\n foo" | gawk 'BEGIN{RS="";FS="\n "}{print NR,NF,$1,length($1)}'
>
> I think that NF should be 2, and the length of the "foo field" should be 3,
> but in my testing, NF always comes up 1 and the length always comes up 4.


Here's how I'd explain that:

RS takes precedence over FS so in this case by setting RS="" we're
saying that a sequence of 1 or more blank lines is the record separator
and so the "\n" at the start of the printf is being treated as a
sequence of 1 blank line and so swalled as record separator. That just
leaves " foo" which is 1 field with size 4.

What I find odd though is that if I add a blank line to the end of the
input string to explicitly satisfy the RS:

printf "\n foo\n\n" |
gawk 'BEGIN{RS="";FS="\n "}{print NR,NF,$2,length($2)}'

Now the "foo" field IS number 2 and it's length is 3 which I find
confusing since I thought the end of input (file) was supposed to get
treated the same as the end of a record but that's not what's happening
in this case.

Ed.

Kenny McCormack

2004-07-30, 3:55 am

In article <cebqbu$9hk@netnews.proxy.lucent.com>,
Ed Morton <morton@lsupcaemnt.com> wrote:
>
>
>Kenny McCormack wrote:
><snip>
>
>Here's how I'd explain that:
>
>RS takes precedence over FS so in this case by setting RS="" we're
>saying that a sequence of 1 or more blank lines is the record separator
>and so the "\n" at the start of the printf is being treated as a
>sequence of 1 blank line and so swalled as record separator. That just
>leaves " foo" which is 1 field with size 4.


I can't say as I buy this. I think you are right that "RS takes
precedence over FS", but that statement could only have relevance if RS==FS
(i.e., if they are set to the same thing). Here, they are not; I leave it
as an exercise to prove that if they are equal, then every record has
either 0 or 1 field (cannot have more).

Further, having RS="" is, as far as I can tell, though I can find no
reference for this at the moment, equivalent to RS="\n\n+" - i.e., *2* or
more adjacent newlines. Since the string I am feeding to gawk above
contains only 1 newline character in the entire string, RS does not come
into play.

>What I find odd though is that if I add a blank line to the end of the
>input string to explicitly satisfy the RS:
>
>printf "\n foo\n\n" |
> gawk 'BEGIN{RS="";FS="\n "}{print NR,NF,$2,length($2)}'
>
>Now the "foo" field IS number 2 and it's length is 3 which I find
>confusing since I thought the end of input (file) was supposed to get
>treated the same as the end of a record but that's not what's happening
>in this case.


I could not replicate this. And if you can, then it is clearly a bug.
(To make that more explicit, non-presence of an RS does not make an input
string invalid. It just means that NR is never greater than 1...)

Observe: printf "foo" | gawk '{print NR,NF,$1,length($1)}'

Ed Morton

2004-07-30, 3:55 am



Kenny McCormack wrote:
> In article <cebqbu$9hk@netnews.proxy.lucent.com>,
> Ed Morton <morton@lsupcaemnt.com> wrote:
>
>
>
> I can't say as I buy this. I think you are right that "RS takes
> precedence over FS", but that statement could only have relevance if RS==FS
> (i.e., if they are set to the same thing).


It's also relevant when one is a subset of the other. That's not quite
what's happening here, but it's close - the RS specification of "" is
sucking up the "\n" that you'd like to be part of the FS.

Here, they are not; I leave it
> as an exercise to prove that if they are equal, then every record has
> either 0 or 1 field (cannot have more).
>
> Further, having RS="" is, as far as I can tell, though I can find no
> reference for this at the moment, equivalent to RS="\n\n+" - i.e., *2* or
> more adjacent newlines.


I agree in that 2 adjacent newlines constitute one blank line but try
setting RS to "\n\n+" and see if it produces the results you expect (I
do this below and explain why they aren't exactly equivalent).

Since the string I am feeding to gawk above
> contains only 1 newline character in the entire string, RS does not come
> into play.


I disagree. printf "X" produces an X while printf "\nX" produces a blank
line followed by an X:

$ printf "X"
X$
$ printf "\nX"

X$

According to this from the GNU awk user's guide
(http://www.gnu.org/software/gawk/ma..._mono/gawk.html):

<By a special dispensation, an empty string as the value of RS indicates
that records are separated by *one* or more blank lines.>

so that single blank line should be treated as a record separator.

Having said that, the documentation DOES also go on to say the
contradictory:

<You can achieve the same effect as RS = "" by assigning the string
"\n\n+" to RS.>

so let's try that with a couple of versions of gawk:

gawk 3.0.4 on Solaris(ksh88):

$ printf "\n foo" |gawk 'BEGIN{RS="";FS="\n ";OFS=";"}{print NR,NF,$1,$2}'
1;1; foo;
$ printf "\n foo" |gawk 'BEGIN{RS="\n\n+";FS="\n ";OFS=";"}{print
NR,NF,$1,$2}'
1;2;;foo

gawk 3.1.3 on Cygwin(bash):

$ printf "\n foo" |gawk 'BEGIN{RS="";FS="\n ";OFS=";"}{print NR,NF,$1,$2}'
1;1; foo;
$ printf "\n foo" |gawk 'BEGIN{RS="\n\n+";FS="\n ";OFS=";"}{print
NR,NF,$1,$2}'
1;2;;fo

So, setting RS="" vs RS="\n\n+" produces different results on both
platforms. That is explained by this further text in the manual:

<There is an important difference between RS = "" and RS = "\n\n+". In
the first case, leading newlines in the input data file are ignored
.....In the second case, this special processing is not done.>

and that's why your initial newline is not being treated as you'd like.

I've no idea why gawk 3.1.3 chooses to truncate "foo" to "fo" in the
final example above!

>
>
>
> I could not replicate this. And if you can, then it is clearly a bug.


I couldn't replicate it on my Cygwin distibution with gawk 3.1.3 either,
but this is gawk 3.0.4 on either ksh88 or bash on Solaris (SunOS 5.8):

$ printf "\n foo" |gawk 'BEGIN{RS="";FS="\n "}{print
NR,NF,$1,length($1)}'
1 1 foo 4
$ printf "\n foo\n\n" |gawk 'BEGIN{RS="";FS="\n "}{print
NR,NF,$2,length($2)}'
1 2 foo 3
$ gawk --version
GNU Awk 3.0.4

> (To make that more explicit, non-presence of an RS does not make an input
> string invalid. It just means that NR is never greater than 1...)
>


I agree that's the intent. It looks like there's bugs in both version
of gawk I'm using 8-(.

Ed.

> Observe: printf "foo" | gawk '{print NR,NF,$1,length($1)}'
>


Stepan Kasal

2004-07-30, 8:55 am

Hello,

In article <MJWdnXESBq14QpTc4p2dnA@comcast.com>, Ed Morton wrote:
> $ printf "\n foo" |gawk 'BEGIN{RS="\n\n+";FS="\n ";OFS=";"}{print
> NR,NF,$1,$2}'
> 1;2;;fo

....
> I've no idea why gawk 3.1.3 chooses to truncate "foo" to "fo" in the
> final example above!


That was a bug in gawk 3.1.3. A few days ago I have verified that the
current beta, which Arnold has also announced here, has this bug fixed.

Have a nice day,
Stepan
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com