For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > June 2007 > zero width lookahead match









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author zero width lookahead match
Sharan Basappa

2007-05-30, 7:58 am

Hi All,

I have some background working with scanners built from Flex. And I have
used lookahead capability of flex many a times. But I dont understand the
meaning of ZERO in zero lookahead match rule i.e. (?=pattern)

For example, to capture overlapping 3 digit patterns from string $str =
123456
I use the regex @store = $str =~ m/(?=(\d\d\d))/g;
So here the regex engine actually looks ahead by chars digits.

The other question I have is - how does regex engine decide that it has to
move further its scanner by 1 character everytime since I get output 123 234
345 456
when I run this script ?

Regards,
Sharan

Chas Owens

2007-05-30, 6:59 pm

On 5/30/07, Sharan Basappa <sharan.basappa@gmail.com> wrote:
> Hi All,
>
> I have some background working with scanners built from Flex. And I have
> used lookahead capability of flex many a times. But I dont understand the
> meaning of ZERO in zero lookahead match rule i.e. (?=pattern)

snip

I don't know jack about flex, so I can't help you with a comparison, but

snip
> The other question I have is - how does regex engine decide that it has to
> move further its scanner by 1 character everytime

snip

this is what the zero-width lookahead assertion means. It say with
out moving where you are currently starting the match, make certain
you can match the following pattern. If you want it to move where the
match starts then you have to include something that does not have
zero-width like this

#match groups of three characters followed by three characters: "123" and "456"
@store = $str =~ m/(\d\d\d)(?=\d\d\d)/g;
Chas Owens

2007-05-30, 6:59 pm

On 5/30/07, Sharan Basappa <sharan.basappa@gmail.com> wrote:
> Hi All,
>
> I have some background working with scanners built from Flex. And I have
> used lookahead capability of flex many a times. But I dont understand the
> meaning of ZERO in zero lookahead match rule i.e. (?=pattern)

snip

You may also prefer to use the Parse::RecDescent module.
Sharan Basappa

2007-05-30, 6:59 pm

>> this is what the zero-width lookahead assertion means. It say with
[color=darkred]
"456"[color=darkred]

You mention that if I write a rule like @store = $str =~ m/((?=\d\d\d))/g;
then the scanner does not move ahead. But as I mentioned in my mail,
the result of this regex is 123 234 etc. This clearly shows that after every
match,
the regex engine of perl is moving its pointer to next char in the string (
i.e. it starts
looking at 23456 once 123 is matched)
This was exactly my question.

Regarding the other question about comparing with Flex, actually there is
no need to compare with flex. What I was trying to understand is, why is
that
it is called zero lookahead rule when the number of chars it looks ahead
depends
on the rule I write. For example, the regex in the above rule looks ahead 3
chars
ahead to find a match ..

Regards,
Sharan




On 5/30/07, Chas Owens <chas.owens@gmail.com> wrote:[color=darkred]
>
> On 5/30/07, Sharan Basappa <sharan.basappa@gmail.com> wrote:
> the
> snip
>
> I don't know jack about flex, so I can't help you with a comparison, but
>
> snip
> to
> snip
>
> this is what the zero-width lookahead assertion means. It say with
> out moving where you are currently starting the match, make certain
> you can match the following pattern. If you want it to move where the
> match starts then you have to include something that does not have
> zero-width like this
>
> #match groups of three characters followed by three characters: "123" and
> "456"
> @store = $str =~ m/(\d\d\d)(?=\d\d\d)/g;
>


Chas Owens

2007-05-30, 6:59 pm

On 5/30/07, Sharan Basappa <sharan.basappa@gmail.com> wrote:
>
> "456"
>
> You mention that if I write a rule like @store = $str =~ m/((?=\d\d\d))/g;
> then the scanner does not move ahead. But as I mentioned in my mail,
> the result of this regex is 123 234 etc. This clearly shows that after every
> match,
> the regex engine of perl is moving its pointer to next char in the string
> (i.e. it starts
> looking at 23456 once 123 is matched)
> This was exactly my question.

snip

Because it always moves ahead by either one character or the match,
but zero-width constructs do not consume any characters. That is why
they are called zero-width.

snip
> Regarding the other question about comparing with Flex, actually there is
> no need to compare with flex. What I was trying to understand is, why is
> that
> it is called zero lookahead rule when the number of chars it looks ahead
> depends
> on the rule I write. For example, the regex in the above rule looks ahead 3
> chars
> ahead to find a match ..

snip

Because it is not called zero lookahead, it is called zero-width
positive lookahead assertion, that is it consumes zero characters from
the string while at the same time causing the match to fail if the
assertion does not match.
Rob Dixon

2007-05-30, 6:59 pm

Sharan Basappa wrote:
>
> Hi All,
>
> I have some background working with scanners built from Flex. And I have
> used lookahead capability of flex many a times. But I dont understand the
> meaning of ZERO in zero lookahead match rule i.e. (?=pattern)
>
> For example, to capture overlapping 3 digit patterns from string $str =
> 123456
> I use the regex @store = $str =~ m/(?=(\d\d\d))/g;
> So here the regex engine actually looks ahead by chars digits.


As far as lookahead expressions are concerned, Perl functions identically to
Flex. It is called zero-width lookahead because it matches a zero-width
/position/ in the string instead of a sequence of characters. If I write

'123456' =~ /\d\d\d(...)/

then '456' will be captured as the first three characters were consumed by the
preceding pattern. However if I write

'123456' =~ /(?=\d\d\d)(...)/

then '123' will be captured instead because the lookahead pattern has zero width.

> The other question I have is - how does regex engine decide that it has to
> move further its scanner by 1 character everytime since I get output 123
> 234
> 345 456
> when I run this script ?


The engine moves as far through your target string as it needs to to find a new
match. If I write

'1B3D5F' =~ /(?=(.\d.))/g;

then the engine will find a match at only every second character, and if I use
a much simpler zero-width match, just

'ABCDEF' =~ //g

then the regex will match seven times - at the beginning and end and between
every pair of characters - so the more complex zero-width match you have written
will match at all of the those places as long as there are three digits following.

HTH,

Rob

Sharan Basappa

2007-05-30, 6:59 pm

Thanks Rob and Chas ..

On 5/30/07, Rob Dixon <rob.dixon@350.com> wrote:
>
> Sharan Basappa wrote:
> the
>
> As far as lookahead expressions are concerned, Perl functions identically
> to
> Flex. It is called zero-width lookahead because it matches a zero-width
> /position/ in the string instead of a sequence of characters. If I write
>
> '123456' =~ /\d\d\d(...)/
>
> then '456' will be captured as the first three characters were consumed by
> the
> preceding pattern. However if I write
>
> '123456' =~ /(?=\d\d\d)(...)/
>
> then '123' will be captured instead because the lookahead pattern has zero
> width.
>
> to
>
> The engine moves as far through your target string as it needs to to find
> a new
> match. If I write
>
> '1B3D5F' =~ /(?=(.\d.))/g;
>
> then the engine will find a match at only every second character, and if I
> use
> a much simpler zero-width match, just
>
> 'ABCDEF' =~ //g
>
> then the regex will match seven times - at the beginning and end and
> between
> every pair of characters - so the more complex zero-width match you have
> written
> will match at all of the those places as long as there are three digits
> following.
>
> HTH,
>
> Rob
>
>


Paul Lalli

2007-05-30, 6:59 pm

On May 30, 10:02 am, chas.ow...@gmail.com (Chas Owens) wrote:
> On 5/30/07, Sharan Basappa <sharan.basa...@gmail.com> wrote:


>
> Because it always moves ahead by either one character or the match,
> but zero-width constructs do not consume any characters. That is why
> they are called zero-width.


I got by this too. I think Sharan's question comes down to
"why isn't this an infinite loop?" That is, why does pos() move ahead
one character when it matches 0 characters? This is not limited to
look-ahead assertions. The behavior can be seen in other constructs
as well. For example:

$ perl -wle'
$string = "abc";
while ($string =~ /(.*?)/g) {
print pos($string), ": ", $1;
}
'
0:
1: a
1:
2: b
2:
3: c
3:

It appears that Perl is actually dividing the string up into
"characters" and "slots between character", and allowing pos() to move
to each of them in sequence. So at the beginning, it's at the slot
before the first character, and it can successfully match 0
characters. Then pos() moves to the first character, and the fewest
characters it can find is that one character, so $1 gets 'a'. Then it
moves to the slot between 'a' and 'b'. Etc.

Here's another, that doesn't allow any characters to be matched:
$ perl -wle'
$string = "abc";
while ($string =~ /(.{0})/g) {
print pos($string), ": ", $1;
}
'
0:
1:
2:
3:

Would the above be an accurate description of what's happening? And
if so, is this behavior documented anywhere? I couldn't find it in a
cursory examanation of either perlop or perlre...

Thanks,
Paul Lalli

Chas Owens

2007-05-30, 6:59 pm

On 30 May 2007 08:53:54 -0700, Paul Lalli <mritty@gmail.com> wrote:
snip
> I got by this too. I think Sharan's question comes down to
> "why isn't this an infinite loop?" That is, why does pos() move ahead
> one character when it matches 0 characters? This is not limited to
> look-ahead assertions. The behavior can be seen in other constructs
> as well. For example:
>
> $ perl -wle'
> $string = "abc";
> while ($string =~ /(.*?)/g) {
> print pos($string), ": ", $1;
> }
> '
> 0:
> 1: a
> 1:
> 2: b
> 2:
> 3: c
> 3:


Because /.*?/ matches nothing as well as a, b, and c. So it matches
nothing, then a, then nothing, then b, then nothing, then c. then
nothing.

>
> It appears that Perl is actually dividing the string up into
> "characters" and "slots between character", and allowing pos() to move
> to each of them in sequence. So at the beginning, it's at the slot
> before the first character, and it can successfully match 0
> characters. Then pos() moves to the first character, and the fewest
> characters it can find is that one character, so $1 gets 'a'. Then it
> moves to the slot between 'a' and 'b'. Etc.


Yes, otherwise \b wouldn't work very well.

perldoc perlre
A word boundary ("\b") is a spot between two characters that has a "\w"
on one side of it and a "\W" on the other side of it (in either order),
counting the imaginary characters off the beginning and end of the string
as matching a "\W".

snip
> Here's another, that doesn't allow any characters to be matched:
> $ perl -wle'
> $string = "abc";
> while ($string =~ /(.{0})/g) {
> print pos($string), ": ", $1;
> }
> '
> 0:
> 1:
> 2:
> 3:
>
> Would the above be an accurate description of what's happening? And
> if so, is this behavior documented anywhere? I couldn't find it in a
> cursory examanation of either perlop or perlre...

snip

You are matching the nothing between the characters.
Jeevs

2007-05-31, 7:58 am


> $ perl -wle'
> $string = "abc";
> while ($string =~ /(.*?)/g) {
> print pos($string), ": ", $1;}
>
> '
> 0:
> 1: a
> 1:
> 2: b
> 2:
> 3: c
> 3:
>

Can someone explain the working of the g modifier since my knowledge
of using g was to use it for substituting globally...
Here i get what paul is trying to explain but if i take out g from the
statement why does it say use of uninitialized value and prints
nothing.


Sharan Basappa

2007-05-31, 7:58 am

Its the same logic - continue after first substitution/match.
In case of subst.. it continues and in case of regex, the search
continues after first match until the complete string is exhausted

On 30 May 2007 22:54:39 -0700, jeevs <jeevan.ingale@gmail.com> wrote:
>
> Can someone explain the working of the g modifier since my knowledge
> of using g was to use it for substituting globally...
> Here i get what paul is trying to explain but if i take out g from the
> statement why does it say use of uninitialized value and prints
> nothing.
>
>
>
> --
> To unsubscribe, e-mail: beginners-unsubscribe@perl.org
> For additional commands, e-mail: beginners-help@perl.org
> http://learn.perl.org/
>
>
>

John W. Krahn

2007-06-01, 6:59 pm

Rob Dixon wrote:
>
> As far as lookahead expressions are concerned, Perl functions identically
> to Flex. It is called zero-width lookahead because it matches a zero-width
> /position/ in the string instead of a sequence of characters. If I write
>
> '123456' =~ /\d\d\d(...)/
>
> then '456' will be captured as the first three characters were consumed by
> the preceding pattern. However if I write
>
> '123456' =~ /(?=\d\d\d)(...)/
>
> then '123' will be captured instead because the lookahead pattern has zero
> width.
>
>
> The engine moves as far through your target string as it needs to to find
> a new match. If I write
>
> '1B3D5F' =~ /(?=(.\d.))/g;
>
> then the engine will find a match at only every second character, and if I
> use a much simpler zero-width match, just
>
> 'ABCDEF' =~ //g
>
> then the regex will match seven times - at the beginning and end and
> between every pair of characters


That will only work if there are no previous patterns in your program
otherwise:

perldoc perlop

[ snip ]

If the PATTERN evaluates to the empty string, the last successfully
matched regular expression is used instead. In this case, only
the "g" and "c" flags on the empty pattern is honoured - the other
flags are taken from the original pattern. If no match has
previously succeeded, this will (silently) act instead as a genuine
empty pattern (which will always match).



John
--
Perl isn't a toolbox, but a small machine shop where you can special-order
certain sorts of tools at low cost and in short order. -- Larry Wall
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2009 codecomments.com