Home > Archive > AWK > July 2004 > Joining lines
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
|
|
| eldorado 2004-07-29, 3:55 pm |
| Is there a awk oneliner that would join two lines, only if the second line
has a space at the beginning? ie:
1234
56
789
9876
1
would look like
123456
789
9876
1
--
Randomly generated signature --
We are coming after you. God may have mercy on you, but we won't -- John McCain
| |
| Ed Morton 2004-07-29, 3:55 pm |
|
eldorado wrote:
> Is there a awk oneliner that would join two lines, only if the second line
> has a space at the beginning? ie:
>
> 1234
> 56
> 789
> 9876
> 1
>
> would look like
> 123456
> 789
> 9876
> 1
>
>
Something like this (untested):
gawk 'BEGIN{RS="";FS="\n ";OFS=""}$1=$1'
Regards,
Ed.
| |
| Kenny McCormack 2004-07-29, 3:55 pm |
| In article <ceb8ia$1n6@netnews.proxy.lucent.com>,
Ed Morton <morton@lsupcaemnt.com> wrote:
>
>
>eldorado wrote:
>
>
>Something like this (untested):
>
>gawk 'BEGIN{RS="";FS="\n ";OFS=""}$1=$1'
Hey, that works. Very clever!
| |
| eldorado 2004-07-29, 3:55 pm |
| On Thu, 29 Jul 2004, Kenny McCormack wrote:
> In article <ceb8ia$1n6@netnews.proxy.lucent.com>,
> Ed Morton <morton@lsupcaemnt.com> wrote:
>
> Hey, that works. Very clever!
>
Ed, it does work! Thanks.
If you have a moment would you explain what the $1=$1 line does?
--
Randomly generated signature --
The worst thing about censorship is [deleted by censorship bereau].
| |
| Ed Morton 2004-07-29, 3:55 pm |
|
eldorado wrote:
> On Thu, 29 Jul 2004, Kenny McCormack wrote:
>
>
<snip>[color=darkred]
>
>
> Ed, it does work! Thanks.
>
> If you have a moment would you explain what the $1=$1 line does?
>
It tells awk to re-evaluate the fields so the resulting $0 has what I
specified as an OFS (i.e. nothing) rather than whatever it originally
had as an FS (i.e. "\n ") and the fact I'm doing it as a condition
invokes the default action behavior of printing $0.
Ed.
Ed.
| |
| Kenny McCormack 2004-07-29, 8:55 pm |
| In article <cebjft$6mc@netnews.proxy.lucent.com>,
Ed Morton <morton@lsupcaemnt.com> wrote:
>
>
>eldorado wrote:
><snip>
>
>It tells awk to re-evaluate the fields so the resulting $0 has what I
>specified as an OFS (i.e. nothing) rather than whatever it originally
>had as an FS (i.e. "\n ") and the fact I'm doing it as a condition
>invokes the default action behavior of printing $0.
Yes, but there is something wrong here. First of all, purists will point
out that the "$1=$1" trick - using that as a cute shorthand for "re-scan
it and then print it" - fails, in the general case, if the $1 is a null
string (since null string evaluates to false). However, as an aside, in
real life, I often maintain that this is actually a feature, since it has
the effect of filtering blank lines out of your output (which is usually
[but, of course, not always] desirable).
Anyway, back to the instant case, what do you think should be the output of
this command:
printf "\n foo" | gawk 'BEGIN{RS="";FS="\n "}{print NR,NF,$1,length($1)}'
I think that NF should be 2, and the length of the "foo field" should be 3,
but in my testing, NF always comes up 1 and the length always comes up 4.
Connecting the dots is left as an exercise for the reader...
| |
| Ed Morton 2004-07-29, 8:55 pm |
|
Kenny McCormack wrote:
<snip>
> Anyway, back to the instant case, what do you think should be the output of
> this command:
>
> printf "\n foo" | gawk 'BEGIN{RS="";FS="\n "}{print NR,NF,$1,length($1)}'
>
> I think that NF should be 2, and the length of the "foo field" should be 3,
> but in my testing, NF always comes up 1 and the length always comes up 4.
Here's how I'd explain that:
RS takes precedence over FS so in this case by setting RS="" we're
saying that a sequence of 1 or more blank lines is the record separator
and so the "\n" at the start of the printf is being treated as a
sequence of 1 blank line and so swalled as record separator. That just
leaves " foo" which is 1 field with size 4.
What I find odd though is that if I add a blank line to the end of the
input string to explicitly satisfy the RS:
printf "\n foo\n\n" |
gawk 'BEGIN{RS="";FS="\n "}{print NR,NF,$2,length($2)}'
Now the "foo" field IS number 2 and it's length is 3 which I find
confusing since I thought the end of input (file) was supposed to get
treated the same as the end of a record but that's not what's happening
in this case.
Ed.
| |
| Kenny McCormack 2004-07-30, 3:55 am |
| In article <cebqbu$9hk@netnews.proxy.lucent.com>,
Ed Morton <morton@lsupcaemnt.com> wrote:
>
>
>Kenny McCormack wrote:
><snip>
>
>Here's how I'd explain that:
>
>RS takes precedence over FS so in this case by setting RS="" we're
>saying that a sequence of 1 or more blank lines is the record separator
>and so the "\n" at the start of the printf is being treated as a
>sequence of 1 blank line and so swalled as record separator. That just
>leaves " foo" which is 1 field with size 4.
I can't say as I buy this. I think you are right that "RS takes
precedence over FS", but that statement could only have relevance if RS==FS
(i.e., if they are set to the same thing). Here, they are not; I leave it
as an exercise to prove that if they are equal, then every record has
either 0 or 1 field (cannot have more).
Further, having RS="" is, as far as I can tell, though I can find no
reference for this at the moment, equivalent to RS="\n\n+" - i.e., *2* or
more adjacent newlines. Since the string I am feeding to gawk above
contains only 1 newline character in the entire string, RS does not come
into play.
>What I find odd though is that if I add a blank line to the end of the
>input string to explicitly satisfy the RS:
>
>printf "\n foo\n\n" |
> gawk 'BEGIN{RS="";FS="\n "}{print NR,NF,$2,length($2)}'
>
>Now the "foo" field IS number 2 and it's length is 3 which I find
>confusing since I thought the end of input (file) was supposed to get
>treated the same as the end of a record but that's not what's happening
>in this case.
I could not replicate this. And if you can, then it is clearly a bug.
(To make that more explicit, non-presence of an RS does not make an input
string invalid. It just means that NR is never greater than 1...)
Observe: printf "foo" | gawk '{print NR,NF,$1,length($1)}'
| |
| Ed Morton 2004-07-30, 3:55 am |
|
Kenny McCormack wrote:
> In article <cebqbu$9hk@netnews.proxy.lucent.com>,
> Ed Morton <morton@lsupcaemnt.com> wrote:
>
>
>
> I can't say as I buy this. I think you are right that "RS takes
> precedence over FS", but that statement could only have relevance if RS==FS
> (i.e., if they are set to the same thing).
It's also relevant when one is a subset of the other. That's not quite
what's happening here, but it's close - the RS specification of "" is
sucking up the "\n" that you'd like to be part of the FS.
Here, they are not; I leave it
> as an exercise to prove that if they are equal, then every record has
> either 0 or 1 field (cannot have more).
>
> Further, having RS="" is, as far as I can tell, though I can find no
> reference for this at the moment, equivalent to RS="\n\n+" - i.e., *2* or
> more adjacent newlines.
I agree in that 2 adjacent newlines constitute one blank line but try
setting RS to "\n\n+" and see if it produces the results you expect (I
do this below and explain why they aren't exactly equivalent).
Since the string I am feeding to gawk above
> contains only 1 newline character in the entire string, RS does not come
> into play.
I disagree. printf "X" produces an X while printf "\nX" produces a blank
line followed by an X:
$ printf "X"
X$
$ printf "\nX"
X$
According to this from the GNU awk user's guide
(http://www.gnu.org/software/gawk/ma..._mono/gawk.html):
<By a special dispensation, an empty string as the value of RS indicates
that records are separated by *one* or more blank lines.>
so that single blank line should be treated as a record separator.
Having said that, the documentation DOES also go on to say the
contradictory:
<You can achieve the same effect as RS = "" by assigning the string
"\n\n+" to RS.>
so let's try that with a couple of versions of gawk:
gawk 3.0.4 on Solaris(ksh88):
$ printf "\n foo" |gawk 'BEGIN{RS="";FS="\n ";OFS=";"}{print NR,NF,$1,$2}'
1;1; foo;
$ printf "\n foo" |gawk 'BEGIN{RS="\n\n+";FS="\n ";OFS=";"}{print
NR,NF,$1,$2}'
1;2;;foo
gawk 3.1.3 on Cygwin(bash):
$ printf "\n foo" |gawk 'BEGIN{RS="";FS="\n ";OFS=";"}{print NR,NF,$1,$2}'
1;1; foo;
$ printf "\n foo" |gawk 'BEGIN{RS="\n\n+";FS="\n ";OFS=";"}{print
NR,NF,$1,$2}'
1;2;;fo
So, setting RS="" vs RS="\n\n+" produces different results on both
platforms. That is explained by this further text in the manual:
<There is an important difference between RS = "" and RS = "\n\n+". In
the first case, leading newlines in the input data file are ignored
.....In the second case, this special processing is not done.>
and that's why your initial newline is not being treated as you'd like.
I've no idea why gawk 3.1.3 chooses to truncate "foo" to "fo" in the
final example above!
>
>
>
> I could not replicate this. And if you can, then it is clearly a bug.
I couldn't replicate it on my Cygwin distibution with gawk 3.1.3 either,
but this is gawk 3.0.4 on either ksh88 or bash on Solaris (SunOS 5.8):
$ printf "\n foo" |gawk 'BEGIN{RS="";FS="\n "}{print
NR,NF,$1,length($1)}'
1 1 foo 4
$ printf "\n foo\n\n" |gawk 'BEGIN{RS="";FS="\n "}{print
NR,NF,$2,length($2)}'
1 2 foo 3
$ gawk --version
GNU Awk 3.0.4
> (To make that more explicit, non-presence of an RS does not make an input
> string invalid. It just means that NR is never greater than 1...)
>
I agree that's the intent. It looks like there's bugs in both version
of gawk I'm using 8-(.
Ed.
> Observe: printf "foo" | gawk '{print NR,NF,$1,length($1)}'
>
| |
| Stepan Kasal 2004-07-30, 8:55 am |
| Hello,
In article <MJWdnXESBq14QpTc4p2dnA@comcast.com>, Ed Morton wrote:
> $ printf "\n foo" |gawk 'BEGIN{RS="\n\n+";FS="\n ";OFS=";"}{print
> NR,NF,$1,$2}'
> 1;2;;fo
....
> I've no idea why gawk 3.1.3 chooses to truncate "foo" to "fo" in the
> final example above!
That was a bug in gawk 3.1.3. A few days ago I have verified that the
current beta, which Arnold has also announced here, has this bug fixed.
Have a nice day,
Stepan
|
|
|
|
|