Home > Archive > PERL Beginners > July 2005 > regular expression match question
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
regular expression match question
|
|
| Pine Yan 2005-07-29, 5:01 pm |
|
A script like this:
line1: $string3 = "bacdeabcdefghijklabcdeabcdefghijkl";
line2: $string4 = "xxyyzzbatttvv";
line3: print "\$1 = $1 \@{$-[0],$+[0]}, \$& = $&\n" if($string3
=~ /(a|b)*/);
line4: print "\$1 = $1 \@{$-[0],$+[0]}, \$& = $&\n" if($string4
=~ //);
Run and gett result:
$1 = a @{0,2}, $& = ba
$1 = @{0,0}, $& =
If I change the code of line3 to:
print "\$1 = $1 \@{$-[0],$+[0]}, \$& = $&\n" if($string3 =~
/(a|b)+/);
and keep everything else the same, I will get:
$1 = a @{0,2}, $& = ba
$1 = a @{6,8}, $& = ba
This result doesn't look very promising to me. The first matching
results keeps the same
between * and +, while the second who is supposed to inherit the
pervious match shows
different result.
The version I'm using is 5.6.1. Can anyone tell why is this? A bug? Or
some tricky I can't
figure out? Does this work differently in latest release?
Thanks.
Sincerely
Pine
| |
| Jeff 'japhy' Pinyan 2005-07-29, 5:01 pm |
| On Jul 29, Pine Yan said:
> line1: $string3 = "bacdeabcdefghijklabcdeabcdefghijkl";
> line2: $string4 = "xxyyzzbatttvv";
>
> line3: print "\$1 = $1 \@{$-[0],$+[0]}, \$& = $&\n" if($string3
> =~ /(a|b)*/);
> line4: print "\$1 = $1 \@{$-[0],$+[0]}, \$& = $&\n" if($string4
> =~ //);
>
> $1 = a @{0,2}, $& = ba
> $1 = @{0,0}, $& =
The regex says "match zero or more of (a or b)". In string 1, it matches
a 'b' and then an 'a' at the beginning, thus $& = 'ba'. In string 2, it
matches zero characters (because it's allowed to!) at the beginning, thus
$& eq ''.
> print "\$1 = $1 \@{$-[0],$+[0]}, \$& = $&\n" if($string3 =~
> /(a|b)+/);
>
> $1 = a @{0,2}, $& = ba
> $1 = a @{6,8}, $& = ba
Now your regex says "match one or more of (a or b)". Thus in string 2,
you're matching the "ba" in the middle.
HOWEVER, you're doing something weird. You have a QUANTIFIER (the * or
the +) on a CAPTURING GROUP. Here's an example of the weirdness:
"japhy" =~ /(.)+/;
print $1;
What do you think that prints? It prints 'y'. Why? Because when you put
a quantifier on a capturing group, ONLY THE LAST REPETITION of that
capturing group gets saved. This is why you're getting only ONE letter in
$1.
--
Jeff "japhy" Pinyan % How can we ever be the sold short or
RPI Acacia Brother #734 % the cheated, we who for every service
http://japhy.perlmonk.org/ % have long ago been overpaid?
http://www.perlmonks.org/ % -- Meister Eckhart
| |
| Pine Yan 2005-07-29, 5:01 pm |
| =20
>
>The regex says "match zero or more of (a or b)". In string 1, it
matches=20
>a 'b' and then an 'a' at the beginning, thus $& =3D 'ba'. In string 2,
it=20
>matches zero characters (because it's allowed to!) at the beginning,
thus=20
>$& eq ''.
yeah, I got the idea why $1 has value "a" in this condition, but other
question
still reamins:
If the regexp says "match zero or more of (a or b)", why can't we
match
an empty string in the first place? What causes "(a|b)*" to make no
difference
from "(a|b)+"?
Thanks!
Sincerely
Pine
| |
| Jeff 'japhy' Pinyan 2005-07-29, 5:01 pm |
| On Jul 29, Pine Yan said:
>
> If the regexp says "match zero or more of (a or b)", why can't we
> match an empty string in the first place? What causes "(a|b)*" to make no
> difference from "(a|b)+"?
The regex /[ab]*/ on the string "bad" matches 'ba' because regexes are
greedy by default. They want to match as MUCH as they can.
BUT regexes also try to find the earliest match in the string. This is
why /[ab]*/ on the string "cab" matches ''. Because the engine found a
successful match of 0 a's or b's at the beginning of the string.
--
Jeff "japhy" Pinyan % How can we ever be the sold short or
RPI Acacia Brother #734 % the cheated, we who for every service
http://japhy.perlmonk.org/ % have long ago been overpaid?
http://www.perlmonks.org/ % -- Meister Eckhart
| |
| Tom Allison 2005-07-29, 10:00 pm |
| Pine Yan wrote:
> A script like this:
>
> line1: $string3 = "bacdeabcdefghijklabcdeabcdefghijkl";
> line2: $string4 = "xxyyzzbatttvv";
>
> line3: print "\$1 = $1 \@{$-[0],$+[0]}, \$& = $&\n" if($string3
> =~ /(a|b)*/);
> line4: print "\$1 = $1 \@{$-[0],$+[0]}, \$& = $&\n" if($string4
> =~ //);
>
> Run and gett result:
>
> $1 = a @{0,2}, $& = ba
> $1 = @{0,0}, $& =
>
>
line 3: you are matching the last single letter of the first match
[ ^ba ] as $1.
@+ and @- merely point to the starting points on that string $string3
being characters 1 and 3 (in array-speak it's 0,2) and from perldocs you
can find what $& is all about.
If you wanted to match either 'ab' or 'ba' you would do it this way:
(ab|ba) to require two letters.
> If I change the code of line3 to:
>
> print "\$1 = $1 \@{$-[0],$+[0]}, \$& = $&\n" if($string3 =~
> /(a|b)+/);
>
> and keep everything else the same, I will get:
>
> $1 = a @{0,2}, $& = ba
> $1 = a @{6,8}, $& = ba
I think you are right in that this doesn't make sense right away.
It appears that the second expression is returning a match for the
previous regex and not a regex of //.
So I don't think that the statement of $string =~ // is going to return
the first element '', rather it returns the last regex that was applied.
RUN THIS
$string3="This is my favorite day.";
$string4="Some days are better than others.";
print $-[0],"\n" if $string4 =~ //;
print $-[0],"\n" if $string4 =~ /day/;
print $-[0],"\n" if $string3 =~ //;
print $-[0],"\n" if $string3 =~ /day/;
and you get
0
5
20
20
0 because there is no regex expression defined.
but the first '20' acts as if you were matching the last expression run
(/day/).
Question: Bug or Feature?
| |
| Pine Yan 2005-07-29, 10:00 pm |
| >
>The regex /[ab]*/ on the string "bad" matches 'ba' because regexes are=20
>greedy by default. They want to match as MUCH as they can.
>
>BUT regexes also try to find the earliest match in the string. This is
>why /[ab]*/ on the string "cab" matches ''. Because the engine found a
>successful match of 0 a's or b's at the beginning of the string.
>
I think I've understood what you mean here.
So next question, :)
Why these two commands give different result:
line3: print "\$1 =3D $1 \@{$-[0],$+[0]}, \$& =3D $&\n" if($string3
=3D~ /(a|b)*/);
line4: print "\$1 =3D $1 \@{$-[0],$+[0]}, \$& =3D $&\n" if($string4
=3D~ //);
result:
$1 =3D a @{0,2}, $& =3D ba
$1 =3D @{0,0}, $& =3D=20
Thanks!
Sincerely
Pine
--=20
Jeff "japhy" Pinyan % How can we ever be the sold short or
RPI Acacia Brother #734 % the cheated, we who for every service
http://japhy.perlmonk.org/ % have long ago been overpaid?
http://www.perlmonks.org/ % -- Meister Eckhart
| |
| Pine Yan 2005-07-29, 10:00 pm |
| =20
According to perdoc, the "//" regexp does mean to inherit pervious
match.
So for me, line3 and line4 shall give the same result. Thus the second
case is correct,=20
while the first one doesn't make sense. And your example also reflects
this inherit
process.
Sincerely
Pine
-----Original Message-----
From: Tom Allison [mailto:tallison@tacocat.net]=20
Sent: Friday, July 29, 2005 3:23 PM
To: beginners@perl.org
Subject: Re: regular expression match question
Pine Yan wrote:
> A script like this:
>=20
> line1: $string3 =3D "bacdeabcdefghijklabcdeabcdefghijkl";
> line2: $string4 =3D "xxyyzzbatttvv";
>=20
> line3: print "\$1 =3D $1 \@{$-[0],$+[0]}, \$& =3D $&\n" if($string3
> =3D~ /(a|b)*/);
> line4: print "\$1 =3D $1 \@{$-[0],$+[0]}, \$& =3D $&\n" if($string4
> =3D~ //);
>=20
> Run and gett result:
>=20
> $1 =3D a @{0,2}, $& =3D ba
> $1 =3D @{0,0}, $& =3D=20
>=20
>
line 3: you are matching the last single letter of the first match
[ ^ba ] as $1.
@+ and @- merely point to the starting points on that string $string3=20
being characters 1 and 3 (in array-speak it's 0,2) and from perldocs you
can find what $& is all about.
If you wanted to match either 'ab' or 'ba' you would do it this way:
(ab|ba) to require two letters.
> If I change the code of line3 to:
>=20
> print "\$1 =3D $1 \@{$-[0],$+[0]}, \$& =3D $&\n" if($string3 =3D~
> /(a|b)+/);
>=20
> and keep everything else the same, I will get:
>=20
> $1 =3D a @{0,2}, $& =3D ba
> $1 =3D a @{6,8}, $& =3D ba
I think you are right in that this doesn't make sense right away.
It appears that the second expression is returning a match for the=20
previous regex and not a regex of //.
So I don't think that the statement of $string =3D~ // is going to =
return
the first element '', rather it returns the last regex that was applied.
RUN THIS
$string3=3D"This is my favorite day.";
$string4=3D"Some days are better than others.";
print $-[0],"\n" if $string4 =3D~ //;
print $-[0],"\n" if $string4 =3D~ /day/;
print $-[0],"\n" if $string3 =3D~ //;
print $-[0],"\n" if $string3 =3D~ /day/;
and you get
0
5
20
20
0 because there is no regex expression defined.
but the first '20' acts as if you were matching the last expression run=20
(/day/).
Question: Bug or Feature?
--=20
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
<http://learn.perl.org/> <http://learn.perl.org/first-response>
| |
| Jeff 'japhy' Pinyan 2005-07-29, 10:00 pm |
| On Jul 29, Pine Yan said:
> line3: print "\$1 = $1 \@{$-[0],$+[0]}, \$& = $&\n" if($string3
> =~ /(a|b)*/);
> line4: print "\$1 = $1 \@{$-[0],$+[0]}, \$& = $&\n" if($string4
> =~ //);
>
> $1 = a @{0,2}, $& = ba
> $1 = @{0,0}, $& =
I'll go over it again. $string3 starts with "bac...", and $string4
starts with "xyx...".
When the regex /(a|b)*/ is applied to "bac...", this is what it does:
# 'a' or 'b', zero or more times
b a c ......
^ position 0: a? no
^ position 0: b? yes
b a c ......
^ position 1: a? yes
b a c ......
^ position 2: a? no
^ position 2: b? no
# successful match from position 0 to position 2 ("ba")
When the regex /(a|b)*/ is applied to "xyx...", this is what it does:
# 'a' or 'b', zero or more times
x y x ......
^ position 0: a? no
^ position 0: b? no
# successful match from position 0 to position 0 ("")
This is because the * quantifier means that ZERO matches of that token is
a perfectly acceptable outcome. The regex got a match at the left-most
position it tried, so it uses that match.
LEFT-MOST, and from there, LONGEST.
--
Jeff "japhy" Pinyan % How can we ever be the sold short or
RPI Acacia Brother #734 % the cheated, we who for every service
http://japhy.perlmonk.org/ % have long ago been overpaid?
http://www.perlmonks.org/ % -- Meister Eckhart
| |
| Pine Yan 2005-07-29, 10:00 pm |
| =20
No more questions. :D
Sincerely
Pine
|
|
|
|
|