Home > Archive > AWK > January 2006 > Match First Sequence in Regular Expression?
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Match First Sequence in Regular Expression?
|
|
| Roger L. Cauvin 2006-01-26, 6:56 pm |
| Say I have some string that begins with an arbitrary sequence of characters
and then alternates repeating the letters 'a' and 'b' any number of times,
e.g.
"xyz123aaabbaabbbbababbbbaaabb"
I'm looking for a regular expression that matches the first, and only the
first, sequence of the letter 'a', and only if the length of the sequence is
exactly 3.
Does such a regular expression exist? If so, any ideas as to what it could
be?
--
Roger L. Cauvin
nospam_roger@cauvin.org (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com
| |
| Janis Papanagnou 2006-01-26, 6:56 pm |
| Roger L. Cauvin wrote:
> Say I have some string that begins with an arbitrary sequence of characters
> and then alternates repeating the letters 'a' and 'b' any number of times,
> e.g.
>
> "xyz123aaabbaabbbbababbbbaaabb"
>
> I'm looking for a regular expression that matches the first, and only the
> first, sequence of the letter 'a', and only if the length of the sequence is
> exactly 3.
/[^a]aaa[^a]/
If you use function sub() you may replace just the first occurrence or use
match() to get the start and end index (adjust it by 1, because the [^a]'s
are also part of the matching pattern).
> Does such a regular expression exist? If so, any ideas as to what it could
> be?
Janis
| |
| Harlan Grove 2006-01-26, 6:56 pm |
| Roger L. Cauvin wrote...
....
>"xyz123aaabbaabbbbababbbbaaabb"
>
>I'm looking for a regular expression that matches the first, and only the
>first, sequence of the letter 'a', and only if the length of the sequence is
>exactly 3.
>
>Does such a regular expression exist? If so, any ideas as to what it could
>be?
Awk regular expressions always match from left to right, so the regexp
aaa([^a]|$)
would match the first sequence of 3 and only 3 a's plus (unfortunately)
the character immediately following, if any. Since awk's regular
expressions lack noncapturing assertions, there's no way to tie this
just to the first such match or avoid including the character
immediately following. If you're trying to replace the first & only the
first such instance, then in generic awk try
if (match(s, /^((a|aa)?[^a]+)*aaa([^a]|$)/)) {
s = substr(s " ", 1, RLENGTH - 4) "<whatever>" substr(s, RLENGTH - 1)
}
and in gawk try
gensub(/^(((a|aa)?[^a]+)*)aaa([^a]|$)/, "\\1" "<your replacement string
here>" "\\4", "", s)
| |
|
|
Roger L. Cauvin wrote:
> Say I have some string that begins with an arbitrary sequence of characters
> and then alternates repeating the letters 'a' and 'b' any number of times,
> e.g.
>
> "xyz123aaabbaabbbbababbbbaaabb"
>
> I'm looking for a regular expression that matches the first, and only the
> first, sequence of the letter 'a', and only if the length of the sequence is
> exactly 3.
>
> Does such a regular expression exist? If so, any ideas as to what it could
> be?
>
> --
To print the position of the first match, for example,
$ echo xyzaaaaa123aaabbaa | perl -nle 'print length($`)+1 if
/(?<!a)a{3}[^a]/'
12
James
| |
| Janis Papanagnou 2006-01-26, 6:56 pm |
| James wrote:
> Roger L. Cauvin wrote:
>
>
>
> To print the position of the first match, for example,
>
> $ echo xyzaaaaa123aaabbaa | perl -nle 'print length($`)+1 if
> /(?<!a)a{3}[^a]/'
> 12
>
> James
>
How about this...
echo "a regular expression that matches the first, and only the \
first, sequence of the letter 'a', and only if the length of the \
sequence is exactly 3" | yaAIparser --naturalspeech
Though, unfortunately, both of our solutions are off-topic in c.l.awk,
especially if not even the regexps can be borrowed to be used in awk.
Janis
| |
| William James 2006-01-26, 6:56 pm |
| James wrote:
> Roger L. Cauvin wrote:
>
> To print the position of the first match, for example,
>
> $ echo xyzaaaaa123aaabbaa | perl -nle 'print length($`)+1 if
> /(?<!a)a{3}[^a]/'
> 12
This fails if the "aaa" is at the end of the string. And the o.p. said
that no a's can precede "aaa".
echo xyz123aaabbaabbbababbbaaab|ruby -ne'p $1.size+1 if
/^([^a]*)aaa(?!a)/'
| |
| Roger L. Cauvin 2006-01-26, 6:56 pm |
| "Janis Papanagnou" <Janis_Papanagnou@hotmail.com> wrote in message
news:drb6um$gbf$1@online.de...
> Roger L. Cauvin wrote:
>
> /[^a]aaa[^a]/
>
> If you use function sub() you may replace just the first occurrence or use
> match() to get the start and end index (adjust it by 1, because the [^a]'s
> are also part of the matching pattern).
Someone on another newsgroup provided a solution:
..*?(?<![ab])aaab
--
Roger L. Cauvin
nospam_roger@cauvin.org (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com
| |
| Ed Morton 2006-01-26, 6:56 pm |
|
Roger L. Cauvin wrote:
> "Janis Papanagnou" <Janis_Papanagnou@hotmail.com> wrote in message
> news:drb6um$gbf$1@online.de...
>
>
>
> Someone on another newsgroup provided a solution:
>
> .*?(?<![ab])aaab
>
How would you use that? I don't think the original question has any
meaning for actual usage since (in awk at least) REs always match from
left to right. A more interesting question would be how to match
anything other than the first occurrence, but maybe I shouldn't open
that can of worms....
Ed.
| |
| Harlan Grove 2006-01-26, 6:56 pm |
| Ed Morton wrote...
....
> . . . A more interesting question would be how to match
>anything other than the first occurrence, but maybe I shouldn't open
>that can of worms....
But now that you have, so to find, say, the 3rd match using gawk (with
--re-interval),
match(s, /^((((a|aa)?[^a]+)*)aaa([^a]|$)){3}/)
which would set RLENGTH to the end of the entire matched substring.
| |
| eeb4u@hotmail.com 2006-01-26, 9:55 pm |
|
Harlan Grove wrote:
> Ed Morton wrote...
> ...
>
> But now that you have, so to find, say, the 3rd match using gawk (with
> --re-interval),
>
> match(s, /^((((a|aa)?[^a]+)*)aaa([^a]|$)){3}/)
>
> which would set RLENGTH to the end of the entire matched substring.
Back to OP, I am a novice user of sed and awk, but I will take a shot
at the OP using sed
create file named sedcmd
contents:
sed -n "
/^......a\{3\}/{
p
/^......a\{3\}/q
}" $1
and issue "sedcmd <datafile>" at command prompt.
this worked with my sample data.
constructive criticism welcomed (go easy on me!)
Mike Dundas
System Administrator
Asbury Park Press
| |
| Ed Morton 2006-01-27, 3:55 am |
| eeb4u@hotmail.com wrote:
> Harlan Grove wrote:
>
>
>
> Back to OP, I am a novice user of sed and awk, but I will take a shot
> at the OP using sed
>
> create file named sedcmd
> contents:
>
> sed -n "
> /^......a\{3\}/{
> p
> /^......a\{3\}/q
> }" $1
>
> and issue "sedcmd <datafile>" at command prompt.
>
> this worked with my sample data.
>
> constructive criticism welcomed (go easy on me!)
Hopefully you won't take this the wrong way as I do intend it to be
constructive but I can't be too constructive since fixing a UNIX sed
solution would be OT for comp.lang.awk:
a) This is an awk NG so it's OT to post a sed response to an awk question.
b) This isn't a UNIX NG so it's OT to post a UNIX-specific solution
(though it would be fine to post a general solution with an example of
how it'd work on UNIX, especially if the OP indicated they were using UNIX).
c) The posted solution finds the first occurence of 3 or more "a"s
preceeded by exactly 6 characters at the start of the line, which isn't
what the OP asked for.
d) The posted solution is needlessly complicated, even for sed (i.e.
there's a briefer way to do that same thing in sed)
Regards,
Ed.
| |
| Ed Morton 2006-01-27, 3:55 am |
| Harlan Grove wrote:
> Ed Morton wrote...
> ...
>
>
>
> But now that you have, so to find, say, the 3rd match using gawk (with
> --re-interval),
>
> match(s, /^((((a|aa)?[^a]+)*)aaa([^a]|$)){3}/)
>
> which would set RLENGTH to the end of the entire matched substring.
>
Yeah, but I still don't see what you'd do with that. I mean, if you
wanted to print everything after the third occurrence of the pattern,
say, then you could just do something like this:
awk 'BEGIN{FS=OFS="aaa"}{sub($1 FS $2 FS,"")}1'
If you just wanted to find out if there WAS a third occurrence of the
pattern, you could do something like this:
awk -F"aaa" 'NF>3'
Perhaps the OP will shed some light on what they're really trying to do....
Ed.
| |
| Harlan Grove 2006-01-27, 3:55 am |
| Ed Morton wrote...
....
>wanted to print everything after the third occurrence of the pattern,
>say, then you could just do something like this:
>
> awk 'BEGIN{FS=OFS="aaa"}{sub($1 FS $2 FS,"")}1'
....
Begs the question whether
xaaaxaaaaaax
has 1 or 3 qualifying sequences of 3 & only 3 a's. If only 1, then your
1-liner doesn't meet specs. I'm interpreting the OP literally: exactly
3 a's in sequence.
| |
| Alan Mackenzie 2006-01-27, 3:55 am |
| Harlan Grove <hrlngrv@aol.com> wrote on 26 Jan 2006 11:31:01 -0800:
> Awk regular expressions always match from left to right, so the regexp
> aaa([^a]|$)
> would match the first sequence of 3 and only 3 a's plus (unfortunately)
> the character immediately following, if any. Since awk's regular
> expressions lack noncapturing assertions, there's no way to tie this
> just to the first such match or avoid including the character
> immediately following.
"Noncapturing assertions". A lovely term, and I understand exactly what
it means! Is there any implementation of regular expressions which does
have these things? perl, perhaps? What would the regexp for exactly
matching a sequence of exactly three "a"s then look like?
--
Alan Mackenzie (Munich, Germany)
Email: aacm@muuc.dee; to decode, wherever there is a repeated letter
(like "aa"), remove half of them (leaving, say, "a").
| |
| Harlan Grove 2006-01-27, 6:56 pm |
| Alan Mackenzie wrote...
....
>"Noncapturing assertions". A lovely term, and I understand exactly what
>it means! Is there any implementation of regular expressions which does
>have these things? perl, perhaps? What would the regexp for exactly
>matching a sequence of exactly three "a"s then look like?
Yes, perl has 'em, as does .Net and any software making use of the pcre
package. As for what such regexps would look like, wouldn't that be
rather off-topic in c.l.a?
| |
| Alan Mackenzie 2006-01-28, 7:55 am |
| Harlan Grove <hrlngrv@aol.com> wrote on 27 Jan 2006 09:58:30 -0800:
> Alan Mackenzie wrote...
> ...
[color=darkred]
> Yes, perl has 'em, as does .Net and any software making use of the pcre
> package. As for what such regexps would look like, wouldn't that be
> rather off-topic in c.l.a?
I suppose so. I mean, discussing features which would be nice to hack
into a future version of gawk really isn't the purpose of this newsgroup,
is it? It could confuse people reading it.
;-)
--
Alan Mackenzie (Munich, Germany)
Email: aacm@muuc.dee; to decode, wherever there is a repeated letter
(like "aa"), remove half of them (leaving, say, "a").
| |
| Kenny McCormack 2006-01-28, 6:55 pm |
| In article <4ncfrd.68.ln@acm.acm>, Alan Mackenzie <acm@muc.de> wrote:
....
>I suppose so. I mean, discussing features which would be nice to hack
>into a future version of gawk really isn't the purpose of this newsgroup,
>is it? It could confuse people reading it.
I think that any people who would be by that are, by definition,
already .
|
|
|
|
|