Home > Archive > Unix Programming > December 2006 > The precise behaviour of the | operator in POSIX extended regexps
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
The precise behaviour of the | operator in POSIX extended regexps
|
|
| Spiros Bousbouras 2006-12-19, 7:10 pm |
| Assume we have a regexp of the form 'E1|E2' where E1 , E2 are also
regexps and we attempt to match it against a string where both E1
and E2 match. Does the POSIX standard (or some man page) determine
which of E1 , E2 is considered the match ? This could be important
in case the whole of E1|E2 is inside parentheses and we refer to it
later.
I'm interested in the answer in the general case where we have regexps
E1 , E2 , ... , En and we form a regexp by taking their disjunction ie
E1|E2|...|En
| |
| Kaz Kylheku 2006-12-20, 4:11 am |
| Spiros Bousbouras wrote:
> Assume we have a regexp of the form 'E1|E2' where E1 , E2 are also
> regexps and we attempt to match it against a string where both E1
> and E2 match. Does the POSIX standard (or some man page) determine
> which of E1 , E2 is considered the match ? This could be important
> in case the whole of E1|E2 is inside parentheses and we refer to it
> later.
Uh, they are /both/ the match. The entire expression.
What if [0-9] is inside parentheses. How do you know which of the ten
digits was the match for that subexpression? [0-9] is equivalent to
0|1|2|3|4|5|6|7|8|9.
> I'm interested in the answer in the general case where we have regexps
> E1 , E2 , ... , En and we form a regexp by taking their disjunction ie
> E1|E2|...|En
If you want to capture the text which matched a given subexpression,
you have to parenthesize /that/ subexpression. So for instance (E1|E2)
isn't good enough if you are interested in distinguishing. But of
course, you can do (E1)|(E2). If you write ((E1)|(E2)), then the first
register contains the match for the entire thing (outer parentheses),
the second register contains the match for E1, and the third for E2. I
think that what happens is that either the second or the third will
simply be empty.
| |
| Spiros Bousbouras 2006-12-21, 7:05 pm |
| Icarus Sparry wrote:
> On Tue, 19 Dec 2006 16:07:56 -0800, Spiros Bousbouras wrote:
>
>
> The usual matching is first "leftmost", then "longest" of successful
> matches.
>
> Using GNU sed, which has patterns of this form, one sees
>
> I="I am the friend of fred and joe today"
> echo "$I" | sed -r 's/joe|fred/bill/'
> outputting
> I am the friend of bill and joe today
>
> So here joe matched leftmost (earliest) in the input.
>
> echo "$I" | sed -r 's/of f[der]*|of fr[^t]*/peter /'
> outputs
> I am the friend of peter today
>
> Here both patterns match at the same place, so the longer one, matching
> "of fred and joe " rather than "of fred" wins.
I did some experiments myself and the behaviour I
observed agrees with what you report above but what
I don't know is whether this behaviour is up to the
implementation or specified by POSIX.
> This may be spelled out in your online manual for "regexp". If not the
> O'Reilly book "Mastering Regular Expressions" is well worth reading.
I have read a good portion of the book. I don't remember
it saying anything about POSIX but I don't have easy access
to it at the moment.
| |
| Spiros Bousbouras 2006-12-21, 7:05 pm |
| Spiros Bousbouras wrote:
> Icarus Sparry wrote:
>
> I did some experiments myself and the behaviour I
> observed agrees with what you report above but what
> I don't know is whether this behaviour is up to the
> implementation or specified by POSIX.
Or to put it otherwise , how portable would a script,
which uses that behaviour, be ?
| |
| Spiros Bousbouras 2006-12-23, 7:02 pm |
| Icarus Sparry wrote:
> On Thu, 21 Dec 2006 16:36:37 -0800, Spiros Bousbouras wrote:
>
>
> Very portable. See
> http://www.opengroup.org/onlinepubs...799/xbd/re.html
>
> "The search for a matching sequence starts at the beginning of a string
> and stops when the first sequence matching the expression is found,
> where first is defined to mean "begins earliest in the string". If the
> pattern permits a variable number of matching characters and thus there
> is more than one such sequence starting at that point, the longest such
> sequence will be matched."
Thanks , that nails it.
|
|
|
|
|