Code Comments
Programming Forum and web based access to our favorite programming groups.Hello, I am new to awk and am having problems with selecting groups of data. I have a file of the format with 5 lines of data consisting of a 'group' of data i'm interested in. If I don't get a complete group I want to ignore all the remaining information associated with that partial group. i.e. in this example file I have three groups, lines 1-5, lines 6,10 and lines 14-18, Lines 11, 12 and 13 are incomplete, and need to be removed, such that I end up with a file of 15 lines. Example File Note: I have added line numbers to help describe my problem. They do not appear in the actual file. 1 AAA 2 BBB 3 CCC 4 DDD 5 DDD 6 AAA 7 BBB 8 CCC 9 DDD 10 DDD 11 AAA 12 BBB 13 CCC 14 AAA 15 BBB 16 CCC 17 DDD 18 DDD The incomplete groups can appear anywhere in my data file. At the start, middle or end. There can be 0 or more incomplete groups. They are always missing the two lines with the repetitive string. (The DDD lines.) I don't know if that makes the task easier or harder! They All help appreciated!
Post Follow-up to this messageIn article <4a67a4e9.0411090831.22108b01@posting.google.com>,
mick <mick_merlin@hotmail.com> wrote:
>Hello,
>
>I am new to awk and am having problems with selecting groups of data.
>I have a file of the format with 5 lines of data consisting of a
>'group' of data i'm interested in.
>If I don't get a complete group I want to ignore all the remaining
>information associated with that partial group.
>i.e. in this example file I have three groups, lines 1-5, lines 6,10
>and lines 14-18,
>Lines 11, 12 and 13 are incomplete, and need to be removed, such that
>I end up with a file of 15 lines.
(With the standard caveat that the problem specification is unclear - one
has to intuit what is really going on)
One way to do it is to key on your "AAA" string and only output groups that
are (exactly) 5 lines long. Something like:
/AAA/ {p()}
{x[++n]=$0}
END {p()}
function p() {
if (n==5)
for (i=1; i<=5; i++)
print x[n]
delete x
n=0
}
Post Follow-up to this messagemick <mick_merlin@hotmail.com> wrote: > Hello, > > I am new to awk and am having problems with selecting groups of data. > I have a file of the format with 5 lines of data consisting of a > 'group' of data i'm interested in. > If I don't get a complete group I want to ignore all the remaining > information associated with that partial group. > i.e. in this example file I have three groups, lines 1-5, lines 6,10 > and lines 14-18, > Lines 11, 12 and 13 are incomplete, and need to be removed, such that > I end up with a file of 15 lines. > > Example File > Note: I have added line numbers to help describe my problem. > They do not appear in the actual file. > > 1 AAA > 2 BBB > 3 CCC > 4 DDD > 5 DDD > 6 AAA > 7 BBB > 8 CCC > 9 DDD > 10 DDD > 11 AAA > 12 BBB > 13 CCC > 14 AAA > 15 BBB > 16 CCC > 17 DDD > 18 DDD > > > The incomplete groups can appear anywhere in my data file. At the > start, middle or end. > There can be 0 or more incomplete groups. > They are always missing the two lines with the repetitive string. (The > DDD lines.) > I don't know if that makes the task easier or harder! > They > > > All help appreciated! Since the presence of DDD is the key, invert the file and print 5 lines from DDD. From top of me head, tac < file | awk '/DDD/,/AAA/'
Post Follow-up to this messagemick wrote:
> Hello,
>
> I am new to awk and am having problems with selecting groups of data.
> I have a file of the format with 5 lines of data consisting of a
> 'group' of data i'm interested in.
> If I don't get a complete group I want to ignore all the remaining
> information associated with that partial group.
> i.e. in this example file I have three groups, lines 1-5, lines 6,10
> and lines 14-18,
> Lines 11, 12 and 13 are incomplete, and need to be removed, such that
> I end up with a file of 15 lines.
>
> Example File
> Note: I have added line numbers to help describe my problem.
> They do not appear in the actual file.
>
> 1 AAA
> 2 BBB
> 3 CCC
> 4 DDD
> 5 DDD
> 6 AAA
> 7 BBB
> 8 CCC
> 9 DDD
> 10 DDD
> 11 AAA
> 12 BBB
> 13 CCC
> 14 AAA
> 15 BBB
> 16 CCC
> 17 DDD
> 18 DDD
>
>
> The incomplete groups can appear anywhere in my data file. At the
> start, middle or end.
> There can be 0 or more incomplete groups.
> They are always missing the two lines with the repetitive string. (The
> DDD lines.)
> I don't know if that makes the task easier or harder!
> They
This should work regardless of where the missing strings are located.
There's extra stuff in there that can be trimmed out, but I wrote it
that way so that I can clean things up with a function. But I wasn't
able to write the function. So now I need help. How do I write a
function for the first three lines in each pattern? The function must
be able to tell me the value of both the array (a) and the index (x).
##########
#!/usr/bin/awk -f
x == 0 {
if (!/AAA/) x = /AAA/ ? 0 : -1
if (x >= 0) a[x++] = $0
else x = 0
next
}
x == 1 {
if (!/BBB/) x = /AAA/ ? 0 : -1
if (x >= 0) a[x++] = $0
else x = 0
next
}
x == 2 {
if (!/CCC/) x = /AAA/ ? 0 : -1
if (x >= 0) a[x++] = $0
else x = 0
next
}
x == 3 {
if (!/DDD/) x = /AAA/ ? 0 : -1
if (x >= 0) a[x++] = $0
else x = 0
next
}
x == 4 {
if (!/DDD/) x = /AAA/ ? 0 : -1
if (x >= 0) a[x++] = $0
else x = 0
if (x == 5) {for (i = 0; i < x; i++) print a[i]; x = 0}
}
##########
--
Regards,
---Robert
Post Follow-up to this messageRobert Katz wrote:
[ . . . ]
> There's extra stuff in there that can be trimmed out, but I wrote it
> that way so that I can clean things up with a function. But I wasn't
> able to write the function. So now I need help. How do I write a
> function for the first three lines in each pattern? The function must
> be able to tell me the value of both the array (a) and the index (x).
>
> ##########
> #!/usr/bin/awk -f
> x == 0 {
> if (!/AAA/) x = /AAA/ ? 0 : -1
> if (x >= 0) a[x++] = $0
> else x = 0
> next
> }
> x == 1 {
> if (!/BBB/) x = /AAA/ ? 0 : -1
> if (x >= 0) a[x++] = $0
> else x = 0
> next
> }
> x == 2 {
> if (!/CCC/) x = /AAA/ ? 0 : -1
> if (x >= 0) a[x++] = $0
> else x = 0
> next
> }
> x == 3 {
> if (!/DDD/) x = /AAA/ ? 0 : -1
> if (x >= 0) a[x++] = $0
> else x = 0
> next
> }
> x == 4 {
> if (!/DDD/) x = /AAA/ ? 0 : -1
> if (x >= 0) a[x++] = $0
> else x = 0
> if (x == 5) {for (i = 0; i < x; i++) print a[i]; x = 0}
> }
> ##########
>
Okay, I rewrote it so that there are three identical lines of action for
each of the four patterns. But I still couldn't write the function, let
alone figure out how to call it.
#!/usr/bin/awk -f
x == 0 {
pattern = /AAA/
if (!pattern) x = /AAA/ ? 0 : -1
if (x >= 0) a[x++] = $0
else x = 0
next
}
x == 1 {
pattern = /BBB/
if (!pattern) x = /AAA/ ? 0 : -1
if (x >= 0) a[x++] = $0
else x = 0
next
}
x == 2 {
pattern = /CCC/
if (!pattern) x = /AAA/ ? 0 : -1
if (x >= 0) a[x++] = $0
else x = 0
next
}
x == 3 || x == 4 {
pattern = /DDD/
if (!pattern) x = /AAA/ ? 0 : -1
if (x >= 0) a[x++] = $0
else x = 0
if (x == 5) {for (i = 0; i < x; i++) print a[i]; x = 0}
}
--
Regards,
---Robert
Post Follow-up to this messagemick wrote:
> Hello,
>
> I am new to awk and am having problems with selecting groups of data.
> I have a file of the format with 5 lines of data consisting of a
> 'group' of data i'm interested in.
> If I don't get a complete group I want to ignore all the remaining
> information associated with that partial group.
> i.e. in this example file I have three groups, lines 1-5, lines 6,10
> and lines 14-18,
> Lines 11, 12 and 13 are incomplete, and need to be removed, such that
> I end up with a file of 15 lines.
>
> Example File
> Note: I have added line numbers to help describe my problem.
> They do not appear in the actual file.
>
> 1 AAA
> 2 BBB
> 3 CCC
> 4 DDD
> 5 DDD
> 6 AAA
> 7 BBB
> 8 CCC
> 9 DDD
> 10 DDD
> 11 AAA
> 12 BBB
> 13 CCC
> 14 AAA
> 15 BBB
> 16 CCC
> 17 DDD
> 18 DDD
>
>
> The incomplete groups can appear anywhere in my data file. At the
> start, middle or end.
> There can be 0 or more incomplete groups.
> They are always missing the two lines with the repetitive string. (The
> DDD lines.)
> I don't know if that makes the task easier or harder!
> They
>
>
> All help appreciated!
Forget all the function stuff that other guy was jabbering about, this
ought to do what you want regardless of which lines are missing.
#!/usr/bin/awk -f
{
if (x == 0) pattern = /AAA/
else if (x == 1) pattern = /BBB/
else if (x == 2) pattern = /CCC/
else if (x == 3 || x == 4) pattern = /DDD/
if (!pattern) x = /AAA/ ? 0 : -1
if (x >= 0) a[x++] = $0
else x = 0
if (x == 5) {for (i = 0; i < x; i++) print a[i]; x = 0}
}
--
Regards,
---Robert
Post Follow-up to this message
Robert Katz wrote:
> mick wrote:
>
>
>
> Forget all the function stuff that other guy was jabbering about, this
> ought to do what you want regardless of which lines are missing.
>
> #!/usr/bin/awk -f
> {
> if (x == 0) pattern = /AAA/
> else if (x == 1) pattern = /BBB/
> else if (x == 2) pattern = /CCC/
> else if (x == 3 || x == 4) pattern = /DDD/
> if (!pattern) x = /AAA/ ? 0 : -1
> if (x >= 0) a[x++] = $0
> else x = 0
> if (x == 5) {for (i = 0; i < x; i++) print a[i]; x = 0}
> }
>
Or alternatively just set the approriate RS and FS then print out the 3
lines before each RS followed by the RS, e.g.:
gawk -vRS="DDD\nDDD\n" -vFS="\n" '
{printf "%s\n%s\n%s\n%s",$(NF-3),$(NF-2),$(NF-1),RS}'
Regards,
Ed.
Post Follow-up to this messageRobert Katz wrote:
> mick wrote:
>
>
>
> Forget all the function stuff that other guy was jabbering about, this
> ought to do what you want regardless of which lines are missing.
>
> #!/usr/bin/awk -f
> {
> if (x == 0) pattern = /AAA/
> else if (x == 1) pattern = /BBB/
> else if (x == 2) pattern = /CCC/
> else if (x == 3 || x == 4) pattern = /DDD/
> if (!pattern) x = /AAA/ ? 0 : -1
> if (x >= 0) a[x++] = $0
> else x = 0
> if (x == 5) {for (i = 0; i < x; i++) print a[i]; x = 0}
> }
>
Okay, just a bit simpler.
#!/usr/bin/awk -f
{
if (x == 0) pattern = /AAA/
else if (x == 1) pattern = /BBB/
else if (x == 2) pattern = /CCC/
else pattern = /DDD/
if (!pattern) x = /AAA/ ? 0 : -1
if (x >= 0) a[x++] = $0
if (x == 5) {for (i = 0; i < x; i++) print a[i]}
}
--
Regards,
---Robert
Post Follow-up to this message
Ed Morton wrote:
<snip>
> Or alternatively just set the approriate RS and FS then print out the 3
> lines before each RS followed by the RS, e.g.:
>
> gawk -vRS="DDD\nDDD\n" -vFS="\n" '
> {printf "%s\n%s\n%s\n%s",$(NF-3),$(NF-2),$(NF-1),RS}'
Just occurred to me that'll fail if the input file doesn't end in
DDD\nDDD\n, so you need a small tweak. You just need to check that RT
got set so this'll work:
gawk -vRS="DDD\nDDD\n" -vFS="\n" '
RT{printf "%s\n%s\n%s\n%s",$(NF-3),$(NF-2),$(NF-1),RS}'
Regards,
Ed.
Post Follow-up to this messageRobert Katz wrote:
> Robert Katz wrote:
>
>
> Okay, just a bit simpler.
>
> #!/usr/bin/awk -f
> {
> if (x == 0) pattern = /AAA/
> else if (x == 1) pattern = /BBB/
> else if (x == 2) pattern = /CCC/
> else pattern = /DDD/
> if (!pattern) x = /AAA/ ? 0 : -1
> if (x >= 0) a[x++] = $0
> if (x == 5) {for (i = 0; i < x; i++) print a[i]}
> }
>
And simpler still. I changed the variable to make it clearer that valid
is just a boolean with values of 0 or 1.
#!/usr/bin/awk -f
{
if (x == 0) valid = /AAA/
else if (x == 1) valid = /BBB/
else if (x == 2) valid = /CCC/
else valid = /DDD/
if (!valid) x = 0
a[x++] = $0
if (x == 5) for (i = 0; i < x; i++) print a[i]
}
--
Regards,
---Robert
Post Follow-up to this messagePowered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.