Home > Archive > PERL Beginners > March 2005 > YA Regex problem: lookahead assertion
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
YA Regex problem: lookahead assertion
|
|
| Jan Eden 2005-03-23, 3:56 pm |
| Hi,
I use the following regex to split a (really simple) file into sections hea=
ded by <h1>.+?</h1>:
while ($content =3D~ m#<h1>(.+?)</h1>(.+?)(?=3D<h1> )#gs) {
...
}
This works perfectly, but obviously does not catch the last section, as it =
is not followed by <h1>.
How can I catch the last section without
* doing a separate match for it
* loosing the convenience of the g switch to wade through the whole file?
Thanks,
Jan
--=20
I'd never join any club that would have the likes of me as a member. - Grou=
cho Marx
| |
| Offer Kaye 2005-03-23, 3:56 pm |
| On Wed, 23 Mar 2005 17:06:59 +0100, Jan Eden wrote:
> Hi,
>
> I use the following regex to split a (really simple) file into sections headed by <h1>.+?</h1>:
>
> while ($content =~ m#<h1>(.+?)</h1>(.+?)(?=<h1> )#gs) {
> ...
> }
>
> This works perfectly, but obviously does not catch the last section, as it is not followed
> by <h1>.
>
> How can I catch the last section without
>
> * doing a separate match for it
> * loosing the convenience of the g switch to wade through the whole file?
>
> Thanks,
>
> Jan
Change your RE to:
m#<h1>(.+?)</h1>(.+?)(?=<h1>|$)#gs
In other words, look ahead to either a <h1> or the end of the string ("$").
I have to admit this problem wasn't as simple as I initially thought -
I still have no idea why my first guess didn't work:
m#<h1>(.+?)</h1>(.+?)(?=<h1> )?#gs
Maybe someone with more knowledge of REs can answer?
Regards,
--
Offer Kaye
| |
| Charles K. Clarkson 2005-03-23, 3:56 pm |
| Jan Eden <mailto:lists@janeden.org> wrote:
: Hi,
:
: I use the following regex to split a (really simple) file into
: sections headed by <h1>.+?</h1>:
:
: while ($content =~ m#<h1>(.+?)</h1>(.+?)(?=<h1> )#gs) {
: ...
: }
The answer may be in your description. Use 'split'. When you
use a capture inside the regular expression in 'split', the
capture is returned. @content is 'shift'ed to rid the first empty
element (or filled if there is something before the first <h1> )
returned by split.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper 'Dumper';
my $content = do{ local $/ = undef; <DATA>; };
my @content = split m|<h1>(.+?)</h1>|, $content;
shift @content;
print Dumper \@content;
__END__
<h1>heading 1</h1>
Some stuff
<h1>heading 2</h1>
Some stuff
<h1>heading 3</h1>
Some stuff
<h1>heading 4</h1>
Some stuff
HTH,
Charles K. Clarkson
--
Mobile Homes Specialist
254 968-8328
| |
| John W. Krahn 2005-03-23, 8:55 pm |
| Jan Eden wrote:
> Hi,
Hello,
> I use the following regex to split a (really simple) file into sections headed by <h1>.+?</h1>:
>
> while ($content =~ m#<h1>(.+?)</h1>(.+?)(?=<h1> )#gs) {
> ...
> }
>
> This works perfectly, but obviously does not catch the last section, as it is not followed by <h1>.
>
> How can I catch the last section without
>
> * doing a separate match for it
> * loosing the convenience of the g switch to wade through the whole file?
This should work (untested)
while ($content =~ m#<h1>(.+?)</h1>(.+?)(?=<h1>|\z)#gs) {
John
--
use Perl;
program
fulfillment
| |
| Jan Eden 2005-03-24, 8:56 am |
| Offer Kaye wrote on 23.03.2005:
>Change your RE to: m#<h1>(.+?)</h1>(.+?)(?=3D<h1>|$)#gs
>
>In other words, look ahead to either a <h1> or the end of the string
>("$"). I have to admit this problem wasn't as simple as I initially
>thought - I still have no idea why my first guess didn't work:
>m#<h1>(.+?)</h1>(.+?)(?=3D<h1> )?#gs
>
>Maybe someone with more knowledge of REs can answer?
John W. Krahn wrote on 23.03.2005:
>This should work (untested)
>
>while ($content =3D~ m#<h1>(.+?)</h1>(.+?)(?=3D<h1>|\z)#gs) {
Hi,
and thanks. I tried Offer Kaye's first guess, too, and I think I can explai=
n why it does not work.
If you make the lookahead optional, the regex will try to match as few char=
acters as possible for the second parentheses - and since the lookahead is =
optional, this will be only a single character.
You have to force a positive lookahead assertion to make sure $2 receives e=
verything up to either the next <h1> or the end of the string.
So the other suggestion works. Thank you! The reason I had not tried that w=
as the wrong assumption that alternations in lookahead/lookbehind assertion=
s had to be of the same length, like in (?=3Dabc|def), but not (?=3Dabc|def=
g). But now I remember that the whole lookahead/lookbehind has to be of a f=
ixed length, so you cannot use quantifiers.
Thanks again,
Jan
--=20
A common mistake that people make when trying to design something completel=
y foolproof is to underestimate the ingenuity of complete fools.
| |
| John W. Krahn 2005-03-24, 8:56 am |
| Jan Eden wrote:
>
> John W. Krahn wrote on 23.03.2005:
>
>
> and thanks. I tried Offer Kaye's first guess, too, and I think I
> can explain why it does not work.
>
> If you make the lookahead optional, the regex will try to match as
> few characters as possible for the second parentheses - and since
> the lookahead is optional, this will be only a single character.
>
> You have to force a positive lookahead assertion to make sure $2
> receives everything up to either the next <h1> or the end of the
> string.
>
> So the other suggestion works. Thank you! The reason I had not tried
> that was the wrong assumption that alternations in
> lookahead/lookbehind assertions had to be of the same length, like
> in (?=abc|def), but not (?=abc|defg). But now I remember that the
> whole lookahead/lookbehind has to be of a fixed length, so you cannot
> use quantifiers.
lookahead CAN use quantifiers but lookbehind CANNOT.
John
--
use Perl;
program
fulfillment
|
|
|
|
|