Home > Archive > AWK > December 2004 > Selecting blocks of data
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Selecting blocks of data
|
|
|
| Hello,
I am new to awk and am having problems with selecting groups of data.
I have a file of the format with 5 lines of data consisting of a
'group' of data i'm interested in.
If I don't get a complete group I want to ignore all the remaining
information associated with that partial group.
i.e. in this example file I have three groups, lines 1-5, lines 6,10
and lines 14-18,
Lines 11, 12 and 13 are incomplete, and need to be removed, such that
I end up with a file of 15 lines.
Example File
Note: I have added line numbers to help describe my problem.
They do not appear in the actual file.
1 AAA
2 BBB
3 CCC
4 DDD
5 DDD
6 AAA
7 BBB
8 CCC
9 DDD
10 DDD
11 AAA
12 BBB
13 CCC
14 AAA
15 BBB
16 CCC
17 DDD
18 DDD
The incomplete groups can appear anywhere in my data file. At the
start, middle or end.
There can be 0 or more incomplete groups.
They are always missing the two lines with the repetitive string. (The
DDD lines.)
I don't know if that makes the task easier or harder!
They
All help appreciated!
| |
| Kenny McCormack 2004-11-16, 6:50 pm |
| In article <4a67a4e9.0411090831.22108b01@posting.google.com>,
mick <mick_merlin@hotmail.com> wrote:
>Hello,
>
>I am new to awk and am having problems with selecting groups of data.
>I have a file of the format with 5 lines of data consisting of a
>'group' of data i'm interested in.
>If I don't get a complete group I want to ignore all the remaining
>information associated with that partial group.
>i.e. in this example file I have three groups, lines 1-5, lines 6,10
>and lines 14-18,
>Lines 11, 12 and 13 are incomplete, and need to be removed, such that
>I end up with a file of 15 lines.
(With the standard caveat that the problem specification is unclear - one
has to intuit what is really going on)
One way to do it is to key on your "AAA" string and only output groups that
are (exactly) 5 lines long. Something like:
/AAA/ {p()}
{x[++n]=$0}
END {p()}
function p() {
if (n==5)
for (i=1; i<=5; i++)
print x[n]
delete x
n=0
}
| |
| William Park 2004-11-16, 6:50 pm |
| mick <mick_merlin@hotmail.com> wrote:
> Hello,
>
> I am new to awk and am having problems with selecting groups of data.
> I have a file of the format with 5 lines of data consisting of a
> 'group' of data i'm interested in.
> If I don't get a complete group I want to ignore all the remaining
> information associated with that partial group.
> i.e. in this example file I have three groups, lines 1-5, lines 6,10
> and lines 14-18,
> Lines 11, 12 and 13 are incomplete, and need to be removed, such that
> I end up with a file of 15 lines.
>
> Example File
> Note: I have added line numbers to help describe my problem.
> They do not appear in the actual file.
>
> 1 AAA
> 2 BBB
> 3 CCC
> 4 DDD
> 5 DDD
> 6 AAA
> 7 BBB
> 8 CCC
> 9 DDD
> 10 DDD
> 11 AAA
> 12 BBB
> 13 CCC
> 14 AAA
> 15 BBB
> 16 CCC
> 17 DDD
> 18 DDD
>
>
> The incomplete groups can appear anywhere in my data file. At the
> start, middle or end.
> There can be 0 or more incomplete groups.
> They are always missing the two lines with the repetitive string. (The
> DDD lines.)
> I don't know if that makes the task easier or harder!
> They
>
>
> All help appreciated!
Since the presence of DDD is the key, invert the file and print 5 lines
from DDD. From top of me head,
tac < file | awk '/DDD/,/AAA/'
| |
| Robert Katz 2004-11-16, 6:50 pm |
| mick wrote:
> Hello,
>
> I am new to awk and am having problems with selecting groups of data.
> I have a file of the format with 5 lines of data consisting of a
> 'group' of data i'm interested in.
> If I don't get a complete group I want to ignore all the remaining
> information associated with that partial group.
> i.e. in this example file I have three groups, lines 1-5, lines 6,10
> and lines 14-18,
> Lines 11, 12 and 13 are incomplete, and need to be removed, such that
> I end up with a file of 15 lines.
>
> Example File
> Note: I have added line numbers to help describe my problem.
> They do not appear in the actual file.
>
> 1 AAA
> 2 BBB
> 3 CCC
> 4 DDD
> 5 DDD
> 6 AAA
> 7 BBB
> 8 CCC
> 9 DDD
> 10 DDD
> 11 AAA
> 12 BBB
> 13 CCC
> 14 AAA
> 15 BBB
> 16 CCC
> 17 DDD
> 18 DDD
>
>
> The incomplete groups can appear anywhere in my data file. At the
> start, middle or end.
> There can be 0 or more incomplete groups.
> They are always missing the two lines with the repetitive string. (The
> DDD lines.)
> I don't know if that makes the task easier or harder!
> They
This should work regardless of where the missing strings are located.
There's extra stuff in there that can be trimmed out, but I wrote it
that way so that I can clean things up with a function. But I wasn't
able to write the function. So now I need help. How do I write a
function for the first three lines in each pattern? The function must
be able to tell me the value of both the array (a) and the index (x).
##########
#!/usr/bin/awk -f
x == 0 {
if (!/AAA/) x = /AAA/ ? 0 : -1
if (x >= 0) a[x++] = $0
else x = 0
next
}
x == 1 {
if (!/BBB/) x = /AAA/ ? 0 : -1
if (x >= 0) a[x++] = $0
else x = 0
next
}
x == 2 {
if (!/CCC/) x = /AAA/ ? 0 : -1
if (x >= 0) a[x++] = $0
else x = 0
next
}
x == 3 {
if (!/DDD/) x = /AAA/ ? 0 : -1
if (x >= 0) a[x++] = $0
else x = 0
next
}
x == 4 {
if (!/DDD/) x = /AAA/ ? 0 : -1
if (x >= 0) a[x++] = $0
else x = 0
if (x == 5) {for (i = 0; i < x; i++) print a[i]; x = 0}
}
##########
--
Regards,
---Robert
| |
| Robert Katz 2004-11-16, 6:50 pm |
| Robert Katz wrote:
[ . . . ]
> There's extra stuff in there that can be trimmed out, but I wrote it
> that way so that I can clean things up with a function. But I wasn't
> able to write the function. So now I need help. How do I write a
> function for the first three lines in each pattern? The function must
> be able to tell me the value of both the array (a) and the index (x).
>
> ##########
> #!/usr/bin/awk -f
> x == 0 {
> if (!/AAA/) x = /AAA/ ? 0 : -1
> if (x >= 0) a[x++] = $0
> else x = 0
> next
> }
> x == 1 {
> if (!/BBB/) x = /AAA/ ? 0 : -1
> if (x >= 0) a[x++] = $0
> else x = 0
> next
> }
> x == 2 {
> if (!/CCC/) x = /AAA/ ? 0 : -1
> if (x >= 0) a[x++] = $0
> else x = 0
> next
> }
> x == 3 {
> if (!/DDD/) x = /AAA/ ? 0 : -1
> if (x >= 0) a[x++] = $0
> else x = 0
> next
> }
> x == 4 {
> if (!/DDD/) x = /AAA/ ? 0 : -1
> if (x >= 0) a[x++] = $0
> else x = 0
> if (x == 5) {for (i = 0; i < x; i++) print a[i]; x = 0}
> }
> ##########
>
Okay, I rewrote it so that there are three identical lines of action for
each of the four patterns. But I still couldn't write the function, let
alone figure out how to call it.
#!/usr/bin/awk -f
x == 0 {
pattern = /AAA/
if (!pattern) x = /AAA/ ? 0 : -1
if (x >= 0) a[x++] = $0
else x = 0
next
}
x == 1 {
pattern = /BBB/
if (!pattern) x = /AAA/ ? 0 : -1
if (x >= 0) a[x++] = $0
else x = 0
next
}
x == 2 {
pattern = /CCC/
if (!pattern) x = /AAA/ ? 0 : -1
if (x >= 0) a[x++] = $0
else x = 0
next
}
x == 3 || x == 4 {
pattern = /DDD/
if (!pattern) x = /AAA/ ? 0 : -1
if (x >= 0) a[x++] = $0
else x = 0
if (x == 5) {for (i = 0; i < x; i++) print a[i]; x = 0}
}
--
Regards,
---Robert
| |
| Robert Katz 2004-11-16, 6:50 pm |
| mick wrote:
> Hello,
>
> I am new to awk and am having problems with selecting groups of data.
> I have a file of the format with 5 lines of data consisting of a
> 'group' of data i'm interested in.
> If I don't get a complete group I want to ignore all the remaining
> information associated with that partial group.
> i.e. in this example file I have three groups, lines 1-5, lines 6,10
> and lines 14-18,
> Lines 11, 12 and 13 are incomplete, and need to be removed, such that
> I end up with a file of 15 lines.
>
> Example File
> Note: I have added line numbers to help describe my problem.
> They do not appear in the actual file.
>
> 1 AAA
> 2 BBB
> 3 CCC
> 4 DDD
> 5 DDD
> 6 AAA
> 7 BBB
> 8 CCC
> 9 DDD
> 10 DDD
> 11 AAA
> 12 BBB
> 13 CCC
> 14 AAA
> 15 BBB
> 16 CCC
> 17 DDD
> 18 DDD
>
>
> The incomplete groups can appear anywhere in my data file. At the
> start, middle or end.
> There can be 0 or more incomplete groups.
> They are always missing the two lines with the repetitive string. (The
> DDD lines.)
> I don't know if that makes the task easier or harder!
> They
>
>
> All help appreciated!
Forget all the function stuff that other guy was jabbering about, this
ought to do what you want regardless of which lines are missing.
#!/usr/bin/awk -f
{
if (x == 0) pattern = /AAA/
else if (x == 1) pattern = /BBB/
else if (x == 2) pattern = /CCC/
else if (x == 3 || x == 4) pattern = /DDD/
if (!pattern) x = /AAA/ ? 0 : -1
if (x >= 0) a[x++] = $0
else x = 0
if (x == 5) {for (i = 0; i < x; i++) print a[i]; x = 0}
}
--
Regards,
---Robert
| |
| Ed Morton 2004-11-16, 6:50 pm |
|
Robert Katz wrote:
> mick wrote:
>
>
>
> Forget all the function stuff that other guy was jabbering about, this
> ought to do what you want regardless of which lines are missing.
>
> #!/usr/bin/awk -f
> {
> if (x == 0) pattern = /AAA/
> else if (x == 1) pattern = /BBB/
> else if (x == 2) pattern = /CCC/
> else if (x == 3 || x == 4) pattern = /DDD/
> if (!pattern) x = /AAA/ ? 0 : -1
> if (x >= 0) a[x++] = $0
> else x = 0
> if (x == 5) {for (i = 0; i < x; i++) print a[i]; x = 0}
> }
>
Or alternatively just set the approriate RS and FS then print out the 3
lines before each RS followed by the RS, e.g.:
gawk -vRS="DDD\nDDD\n" -vFS="\n" '
{printf "%s\n%s\n%s\n%s",$(NF-3),$(NF-2),$(NF-1),RS}'
Regards,
Ed.
| |
| Robert Katz 2004-11-16, 6:50 pm |
| Robert Katz wrote:
> mick wrote:
>
>
>
> Forget all the function stuff that other guy was jabbering about, this
> ought to do what you want regardless of which lines are missing.
>
> #!/usr/bin/awk -f
> {
> if (x == 0) pattern = /AAA/
> else if (x == 1) pattern = /BBB/
> else if (x == 2) pattern = /CCC/
> else if (x == 3 || x == 4) pattern = /DDD/
> if (!pattern) x = /AAA/ ? 0 : -1
> if (x >= 0) a[x++] = $0
> else x = 0
> if (x == 5) {for (i = 0; i < x; i++) print a[i]; x = 0}
> }
>
Okay, just a bit simpler.
#!/usr/bin/awk -f
{
if (x == 0) pattern = /AAA/
else if (x == 1) pattern = /BBB/
else if (x == 2) pattern = /CCC/
else pattern = /DDD/
if (!pattern) x = /AAA/ ? 0 : -1
if (x >= 0) a[x++] = $0
if (x == 5) {for (i = 0; i < x; i++) print a[i]}
}
--
Regards,
---Robert
| |
| Ed Morton 2004-11-16, 6:50 pm |
|
Ed Morton wrote:
<snip>
> Or alternatively just set the approriate RS and FS then print out the 3
> lines before each RS followed by the RS, e.g.:
>
> gawk -vRS="DDD\nDDD\n" -vFS="\n" '
> {printf "%s\n%s\n%s\n%s",$(NF-3),$(NF-2),$(NF-1),RS}'
Just occurred to me that'll fail if the input file doesn't end in
DDD\nDDD\n, so you need a small tweak. You just need to check that RT
got set so this'll work:
gawk -vRS="DDD\nDDD\n" -vFS="\n" '
RT{printf "%s\n%s\n%s\n%s",$(NF-3),$(NF-2),$(NF-1),RS}'
Regards,
Ed.
| |
| Robert Katz 2004-11-16, 6:50 pm |
| Robert Katz wrote:
> Robert Katz wrote:
>
>
> Okay, just a bit simpler.
>
> #!/usr/bin/awk -f
> {
> if (x == 0) pattern = /AAA/
> else if (x == 1) pattern = /BBB/
> else if (x == 2) pattern = /CCC/
> else pattern = /DDD/
> if (!pattern) x = /AAA/ ? 0 : -1
> if (x >= 0) a[x++] = $0
> if (x == 5) {for (i = 0; i < x; i++) print a[i]}
> }
>
And simpler still. I changed the variable to make it clearer that valid
is just a boolean with values of 0 or 1.
#!/usr/bin/awk -f
{
if (x == 0) valid = /AAA/
else if (x == 1) valid = /BBB/
else if (x == 2) valid = /CCC/
else valid = /DDD/
if (!valid) x = 0
a[x++] = $0
if (x == 5) for (i = 0; i < x; i++) print a[i]
}
--
Regards,
---Robert
| |
| Ed Morton 2004-11-16, 6:50 pm |
|
Ed Morton wrote:
<snip>
> Or alternatively just set the approriate RS and FS then print out the 3
> lines before each RS followed by the RS, e.g.:
>
> gawk -vRS="DDD\nDDD\n" -vFS="\n" '
> {printf "%s\n%s\n%s\n%s",$(NF-3),$(NF-2),$(NF-1),RS}'
Just occurred to me that'll fail if the input file doesn't end in
DDD\nDDD\n, so you need a small tweak. You just need to check that RT
got set so this'll work:
gawk -vRS="DDD\nDDD\n" -vFS="\n" '
RT{printf "%s\n%s\n%s\n%s",$(NF-3),$(NF-2),$(NF-1),RS}'
Regards,
Ed.
| |
| Robert Katz 2004-11-16, 6:50 pm |
| Robert Katz wrote:
> Robert Katz wrote:
>
>
> Okay, just a bit simpler.
>
> #!/usr/bin/awk -f
> {
> if (x == 0) pattern = /AAA/
> else if (x == 1) pattern = /BBB/
> else if (x == 2) pattern = /CCC/
> else pattern = /DDD/
> if (!pattern) x = /AAA/ ? 0 : -1
> if (x >= 0) a[x++] = $0
> if (x == 5) {for (i = 0; i < x; i++) print a[i]}
> }
>
And simpler still. I changed the variable to make it clearer that valid
is just a boolean with values of 0 or 1.
#!/usr/bin/awk -f
{
if (x == 0) valid = /AAA/
else if (x == 1) valid = /BBB/
else if (x == 2) valid = /CCC/
else valid = /DDD/
if (!valid) x = 0
a[x++] = $0
if (x == 5) for (i = 0; i < x; i++) print a[i]
}
--
Regards,
---Robert
| |
| Robert Katz 2004-11-16, 6:50 pm |
| mick wrote:
> Hello,
>
> I am new to awk and am having problems with selecting groups of data.
> I have a file of the format with 5 lines of data consisting of a
> 'group' of data i'm interested in.
> If I don't get a complete group I want to ignore all the remaining
> information associated with that partial group.
> i.e. in this example file I have three groups, lines 1-5, lines 6,10
> and lines 14-18,
> Lines 11, 12 and 13 are incomplete, and need to be removed, such that
> I end up with a file of 15 lines.
>
> Example File
> Note: I have added line numbers to help describe my problem.
> They do not appear in the actual file.
>
> 1 AAA
> 2 BBB
> 3 CCC
> 4 DDD
> 5 DDD
> 6 AAA
> 7 BBB
> 8 CCC
> 9 DDD
> 10 DDD
> 11 AAA
> 12 BBB
> 13 CCC
> 14 AAA
> 15 BBB
> 16 CCC
> 17 DDD
> 18 DDD
>
>
> The incomplete groups can appear anywhere in my data file. At the
> start, middle or end.
> There can be 0 or more incomplete groups.
> They are always missing the two lines with the repetitive string. (The
> DDD lines.)
> I don't know if that makes the task easier or harder!
> They
This should work regardless of where the missing strings are located.
There's extra stuff in there that can be trimmed out, but I wrote it
that way so that I can clean things up with a function. But I wasn't
able to write the function. So now I need help. How do I write a
function for the first three lines in each pattern? The function must
be able to tell me the value of both the array (a) and the index (x).
##########
#!/usr/bin/awk -f
x == 0 {
if (!/AAA/) x = /AAA/ ? 0 : -1
if (x >= 0) a[x++] = $0
else x = 0
next
}
x == 1 {
if (!/BBB/) x = /AAA/ ? 0 : -1
if (x >= 0) a[x++] = $0
else x = 0
next
}
x == 2 {
if (!/CCC/) x = /AAA/ ? 0 : -1
if (x >= 0) a[x++] = $0
else x = 0
next
}
x == 3 {
if (!/DDD/) x = /AAA/ ? 0 : -1
if (x >= 0) a[x++] = $0
else x = 0
next
}
x == 4 {
if (!/DDD/) x = /AAA/ ? 0 : -1
if (x >= 0) a[x++] = $0
else x = 0
if (x == 5) {for (i = 0; i < x; i++) print a[i]; x = 0}
}
##########
--
Regards,
---Robert
| |
| Robert Katz 2004-11-16, 6:50 pm |
| Robert Katz wrote:
> Robert Katz wrote:
>
>
> And simpler still. I changed the variable to make it clearer that valid
> is just a boolean with values of 0 or 1.
>
> #!/usr/bin/awk -f
> {
> if (x == 0) valid = /AAA/
> else if (x == 1) valid = /BBB/
> else if (x == 2) valid = /CCC/
> else valid = /DDD/
> if (!valid) x = 0
> a[x++] = $0
> if (x == 5) for (i = 0; i < x; i++) print a[i]
> }
>
Trying to minimize characters is sometimes dangerous. The last solution
is definitely wrong (I found a counterexample). The one before is
suspect. So let me put back one that is definitely right ;-)
#!/usr/bin/awk -f
BEGIN {x = 0}
{
if (x == 0) valid = /AAA/
else if (x == 1) valid = /BBB/
else if (x == 2) valid = /CCC/
else valid = /DDD/
if (!valid) x = /AAA/ ? 0 : -1
if (x >=0 ) a[x++] = $0
else x = 0
if (x == 5) {for (i = 0; i < x; i++) print a[i]; x = 0}
}
--
Regards,
---Robert
| |
| Robert Katz 2004-11-22, 3:56 am |
| Robert Katz wrote:
> Robert Katz wrote:
>
>
> And simpler still. I changed the variable to make it clearer that valid
> is just a boolean with values of 0 or 1.
>
> #!/usr/bin/awk -f
> {
> if (x == 0) valid = /AAA/
> else if (x == 1) valid = /BBB/
> else if (x == 2) valid = /CCC/
> else valid = /DDD/
> if (!valid) x = 0
> a[x++] = $0
> if (x == 5) for (i = 0; i < x; i++) print a[i]
> }
>
Trying to minimize characters is sometimes dangerous. The last solution
is definitely wrong (I found a counterexample). The one before is
suspect. So let me put back one that is definitely right ;-)
#!/usr/bin/awk -f
BEGIN {x = 0}
{
if (x == 0) valid = /AAA/
else if (x == 1) valid = /BBB/
else if (x == 2) valid = /CCC/
else valid = /DDD/
if (!valid) x = /AAA/ ? 0 : -1
if (x >=0 ) a[x++] = $0
else x = 0
if (x == 5) {for (i = 0; i < x; i++) print a[i]; x = 0}
}
--
Regards,
---Robert
| |
| Robert Katz 2004-12-19, 3:55 am |
| mick wrote:
> Hello,
>
> I am new to awk and am having problems with selecting groups of data.
> I have a file of the format with 5 lines of data consisting of a
> 'group' of data i'm interested in.
> If I don't get a complete group I want to ignore all the remaining
> information associated with that partial group.
> i.e. in this example file I have three groups, lines 1-5, lines 6,10
> and lines 14-18,
> Lines 11, 12 and 13 are incomplete, and need to be removed, such that
> I end up with a file of 15 lines.
>
> Example File
> Note: I have added line numbers to help describe my problem.
> They do not appear in the actual file.
>
> 1 AAA
> 2 BBB
> 3 CCC
> 4 DDD
> 5 DDD
> 6 AAA
> 7 BBB
> 8 CCC
> 9 DDD
> 10 DDD
> 11 AAA
> 12 BBB
> 13 CCC
> 14 AAA
> 15 BBB
> 16 CCC
> 17 DDD
> 18 DDD
>
>
> The incomplete groups can appear anywhere in my data file. At the
> start, middle or end.
> There can be 0 or more incomplete groups.
> They are always missing the two lines with the repetitive string. (The
> DDD lines.)
> I don't know if that makes the task easier or harder!
> They
>
>
> All help appreciated!
Since none of the previous solutions looks entirely right, here's
another suggestion.
#!/usr/local/bin/gawk -f
N < 5 {N++}
N == 5 &&
(XXXX ~ /AAA/) &&
(XXX ~ /BBB/) &&
(XX ~ /CCC/) &&
(X ~ /DDD/) &&
/DDD/ {
print XXXX "\n" XXX "\n" XX "\n" X "\n" $0
N = 0
}
{XXXX = XXX; XXX = XX; XX = X; X = $0}
--
Regards,
---Robert
| |
| Ed Morton 2004-12-19, 3:55 pm |
|
Robert Katz wrote:
> mick wrote:
>
>
>
> Since none of the previous solutions looks entirely right, here's
> another suggestion.
>
> #!/usr/local/bin/gawk -f
> N < 5 {N++}
> N == 5 &&
> (XXXX ~ /AAA/) &&
> (XXX ~ /BBB/) &&
> (XX ~ /CCC/) &&
> (X ~ /DDD/) &&
> /DDD/ {
> print XXXX "\n" XXX "\n" XX "\n" X "\n" $0
> N = 0
> }
> {XXXX = XXX; XXX = XX; XX = X; X = $0}
>
I can't come up with a case that the solution I posted:
gawk -vRS="DDD\nDDD\n" -vFS="\n" '
RT{printf "%s\n%s\n%s\n%s",$(NF-3),$(NF-2),$(NF-1),RS}'
wouldn't work for, unless "DDD" is actually a regexp in which case I'd
need a similair solution printing "RT" instead of "RS" but I'm not going
to spend time thinking about that unless the OP says that is the case.
I think the problem with what you posted is the assumption that there's
always some specific pattern "AAA", etc. in the input. When I read the
OPs posting, I assumed he was just using AAA, BBB, and CCC to indicate
that there were 3 lines with some text different from DDD. All we know
for SURE from what the OP posted (because he explicitly states it) is
that each record ends with 2 "DDD" lines and that the only thing that
can be missing is that record terminator. At the end of the day, the OP
was just too vague so we're guessing....
Regards,
Ed.
| |
| William James 2004-12-19, 8:55 pm |
| # A good block consists of 5 lines ending with
# 2 identical lines. Discard bad blocks.
NF \
{ block[++count] = $0
if ( count > 4 && block[count]==block[count-1] )
{ for ( i=count-4; i<=count; i++ )
print block[i]
count = 0
}
}
| |
| Ed Morton 2004-12-19, 8:55 pm |
|
William James wrote:
> # A good block consists of 5 lines ending with
> # 2 identical lines. Discard bad blocks.
>
The OPs definition is "A good block consists of 5 lines ending with 2
lines containing (or consisting entirely of) the pattern DDD".
So, if you get 2 identical lines containing "bob", that's not the end of
a block. Whether the block end contains DDD or is DDD is open to
interopretation.
Ed.
> NF \
> { block[++count] = $0
> if ( count > 4 && block[count]==block[count-1] )
> { for ( i=count-4; i<=count; i++ )
> print block[i]
> count = 0
> }
> }
>
| |
| Robert Katz 2004-12-22, 3:55 am |
| Ed Morton wrote:
>
>
> Robert Katz wrote:
>
>
> I can't come up with a case that the solution I posted:
>
> gawk -vRS="DDD\nDDD\n" -vFS="\n" '
> RT{printf "%s\n%s\n%s\n%s",$(NF-3),$(NF-2),$(NF-1),RS}'
>
> wouldn't work for, unless "DDD" is actually a regexp in which case I'd
> need a similair solution printing "RT" instead of "RS" but I'm not going
> to spend time thinking about that unless the OP says that is the case.
It doesn't work for the specific data given in the example. Just try it.
[ . . . ]
--
Regards,
---Robert
| |
| Robert Katz 2004-12-22, 8:55 am |
| mick wrote:
> Hello,
>
> I am new to awk and am having problems with selecting groups of data.
> I have a file of the format with 5 lines of data consisting of a
> 'group' of data i'm interested in.
> If I don't get a complete group I want to ignore all the remaining
> information associated with that partial group.
> i.e. in this example file I have three groups, lines 1-5, lines 6,10
> and lines 14-18,
> Lines 11, 12 and 13 are incomplete, and need to be removed, such that
> I end up with a file of 15 lines.
>
> Example File
> Note: I have added line numbers to help describe my problem.
> They do not appear in the actual file.
>
> 1 AAA
> 2 BBB
> 3 CCC
> 4 DDD
> 5 DDD
> 6 AAA
> 7 BBB
> 8 CCC
> 9 DDD
> 10 DDD
> 11 AAA
> 12 BBB
> 13 CCC
> 14 AAA
> 15 BBB
> 16 CCC
> 17 DDD
> 18 DDD
>
>
> The incomplete groups can appear anywhere in my data file. At the
> start, middle or end.
> There can be 0 or more incomplete groups.
> They are always missing the two lines with the repetitive string. (The
> DDD lines.)
> I don't know if that makes the task easier or harder!
> They
>
>
> All help appreciated!
Since none of the previous solutions looks entirely right, here's
another suggestion.
#!/usr/local/bin/gawk -f
N < 5 {N++}
N == 5 &&
(XXXX ~ /AAA/) &&
(XXX ~ /BBB/) &&
(XX ~ /CCC/) &&
(X ~ /DDD/) &&
/DDD/ {
print XXXX "\n" XXX "\n" XX "\n" X "\n" $0
N = 0
}
{XXXX = XXX; XXX = XX; XX = X; X = $0}
--
Regards,
---Robert
| |
| Ed Morton 2004-12-22, 8:55 am |
|
Robert Katz wrote:
> mick wrote:
>
>
>
> Since none of the previous solutions looks entirely right, here's
> another suggestion.
>
> #!/usr/local/bin/gawk -f
> N < 5 {N++}
> N == 5 &&
> (XXXX ~ /AAA/) &&
> (XXX ~ /BBB/) &&
> (XX ~ /CCC/) &&
> (X ~ /DDD/) &&
> /DDD/ {
> print XXXX "\n" XXX "\n" XX "\n" X "\n" $0
> N = 0
> }
> {XXXX = XXX; XXX = XX; XX = X; X = $0}
>
I can't come up with a case that the solution I posted:
gawk -vRS="DDD\nDDD\n" -vFS="\n" '
RT{printf "%s\n%s\n%s\n%s",$(NF-3),$(NF-2),$(NF-1),RS}'
wouldn't work for, unless "DDD" is actually a regexp in which case I'd
need a similair solution printing "RT" instead of "RS" but I'm not going
to spend time thinking about that unless the OP says that is the case.
I think the problem with what you posted is the assumption that there's
always some specific pattern "AAA", etc. in the input. When I read the
OPs posting, I assumed he was just using AAA, BBB, and CCC to indicate
that there were 3 lines with some text different from DDD. All we know
for SURE from what the OP posted (because he explicitly states it) is
that each record ends with 2 "DDD" lines and that the only thing that
can be missing is that record terminator. At the end of the day, the OP
was just too vague so we're guessing....
Regards,
Ed.
| |
| Ed Morton 2004-12-22, 3:57 pm |
|
Robert Katz wrote:
> Ed Morton wrote:
<snip>
>
>
> It doesn't work for the specific data given in the example. Just try it.
Works for me:
PS1> cat infile
AAA
BBB
CCC
DDD
DDD
AAA
BBB
CCC
DDD
DDD
AAA
BBB
CCC
AAA
BBB
CCC
DDD
DDD
PS1> gawk -vRS="DDD\nDDD\n" -vFS="\n"
'RT{printf"%s\n%s\n%s\n%s",$(NF-3),$(NF-2),$(NF-1),RS}' infile
AAA
BBB
CCC
DDD
DDD
AAA
BBB
CCC
DDD
DDD
AAA
BBB
CCC
DDD
DDD
PS1> gawk --version
GNU Awk 3.0.4
.....
Regards,
Ed.
> [ . . . ]
>
| |
| Robert Katz 2004-12-22, 8:55 pm |
| Ed Morton wrote:
>
>
> Robert Katz wrote:
>
>
> <snip>
>
>
>
> Works for me:
>
> PS1> cat infile
> AAA
> BBB
> CCC
> DDD
> DDD
> AAA
> BBB
> CCC
> DDD
> DDD
> AAA
> BBB
> CCC
> AAA
> BBB
> CCC
> DDD
> DDD
Okay, I was using
1 AAA
2 BBB
3 CCC
4 DDD
5 DDD
6 AAA
7 BBB
8 CCC
9 DDD
10 DDD
11 AAA
12 BBB
13 CCC
14 AAA
15 BBB
16 CCC
17 DDD
18 DDD
[ . . . ]
--
Regards,
---Robert
| |
| William James 2004-12-22, 8:55 pm |
| # A good block consists of 5 lines ending with
# 2 identical lines. Discard bad blocks.
NF \
{ block[++count] = $0
if ( count > 4 && block[count]==block[count-1] )
{ for ( i=count-4; i<=count; i++ )
print block[i]
count = 0
}
}
| |
| Ed Morton 2004-12-22, 8:55 pm |
|
William James wrote:
> # A good block consists of 5 lines ending with
> # 2 identical lines. Discard bad blocks.
>
The OPs definition is "A good block consists of 5 lines ending with 2
lines containing (or consisting entirely of) the pattern DDD".
So, if you get 2 identical lines containing "bob", that's not the end of
a block. Whether the block end contains DDD or is DDD is open to
interopretation.
Ed.
> NF \
> { block[++count] = $0
> if ( count > 4 && block[count]==block[count-1] )
> { for ( i=count-4; i<=count; i++ )
> print block[i]
> count = 0
> }
> }
>
| |
| Robert Katz 2004-12-25, 8:55 am |
| Ed Morton wrote:
>
>
> Robert Katz wrote:
>
>
> I can't come up with a case that the solution I posted:
>
> gawk -vRS="DDD\nDDD\n" -vFS="\n" '
> RT{printf "%s\n%s\n%s\n%s",$(NF-3),$(NF-2),$(NF-1),RS}'
>
> wouldn't work for, unless "DDD" is actually a regexp in which case I'd
> need a similair solution printing "RT" instead of "RS" but I'm not going
> to spend time thinking about that unless the OP says that is the case.
It doesn't work for the specific data given in the example. Just try it.
[ . . . ]
--
Regards,
---Robert
| |
| Robert Katz 2004-12-25, 3:55 pm |
| Ed Morton wrote:
>
>
> Robert Katz wrote:
>
>
> <snip>
>
>
>
> Works for me:
>
> PS1> cat infile
> AAA
> BBB
> CCC
> DDD
> DDD
> AAA
> BBB
> CCC
> DDD
> DDD
> AAA
> BBB
> CCC
> AAA
> BBB
> CCC
> DDD
> DDD
Okay, I was using
1 AAA
2 BBB
3 CCC
4 DDD
5 DDD
6 AAA
7 BBB
8 CCC
9 DDD
10 DDD
11 AAA
12 BBB
13 CCC
14 AAA
15 BBB
16 CCC
17 DDD
18 DDD
[ . . . ]
--
Regards,
---Robert
|
|
|
|
|