For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > July 2006 > regular expression help









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author regular expression help
Jonathan Weber

2006-07-24, 6:57 pm

Hi. I have some HTML files with lines like the following:

<a name="w12234"> </a> <h2>A Title</h2>

I'm using a regular expression to find these and capture the name
attribute ("w12234" in the example) and the contents of the h2 tag ("A
Title").

$_ =~ /<a name="(w\d+)">\s*<\/a>\s*<h2>(____+)<\/h2>/

That's my regex, except I'm having trouble with the _____ part. No
matter what I seem to try, it won't match incidences where there's a
newline somewhere in the string. I tried all manner of things,
including [.\n], which if I understand correctly should match
*everything*.

I'm doing this on Windows; does the carriage return/line feed business
have anything to do with this?

Thanks in advance.
Paul Lalli

2006-07-24, 6:57 pm

Jonathan Weber wrote:
> Hi. I have some HTML files with lines like the following:
>
> <a name="w12234"> </a> <h2>A Title</h2>
>
> I'm using a regular expression to find these and capture the name
> attribute ("w12234" in the example) and the contents of the h2 tag ("A
> Title").
>
> $_ =~ /<a name="(w\d+)">\s*<\/a>\s*<h2>(____+)<\/h2>/
>
> That's my regex, except I'm having trouble with the _____ part. No
> matter what I seem to try, it won't match incidences where there's a
> newline somewhere in the string.


Have you actually examined each individual string to verify whether or
not there *is* a newline "in" the string? I'm guessing not. I'm
guessing you're processing this HTML line-by-line, meaning that you
have one string that looks like:
<a name="w12234"> </a> <h2>A
and then the next iteration's string looks like:
Title</h2>

Obviously, neither of those strings are going to match your pattern.

You have two basic options (and countless more difficult ones):
1) read the entire file into one big scalar, and do a single
progressive pattern match on that scalar
2) much preferred - stop trying to use regular expressions to parse
HTML. Use an HTML parser, like, for example, HTML::Parser (though, I
recommend HTML::TokeParser for a slightly easier interface)

> I tried all manner of things,
> including [.\n], which if I understand correctly should match
> *everything*.


It does, but it can't match what's not there.

> I'm doing this on Windows; does the carriage return/line feed business
> have anything to do with this?


No, your logic error has everything to do with it.

Paul Lalli

P.S. Of course, since you didn't bother to show a short-but-complete
script that demonstrates your error, and instead decided to show us
only what *you* think is the cause of the error (and really, if you
knew the cause of the error, would you be posting in the first place?),
everything I said above is a complete guess.

Rob Dixon

2006-07-24, 6:57 pm

Jonathan Weber wrote:
>
> Hi. I have some HTML files with lines like the following:
>
> <a name="w12234"> </a> <h2>A Title</h2>
>
> I'm using a regular expression to find these and capture the name
> attribute ("w12234" in the example) and the contents of the h2 tag ("A
> Title").
>
> $_ =~ /<a name="(w\d+)">\s*<\/a>\s*<h2>(____+)<\/h2>/
>
> That's my regex, except I'm having trouble with the _____ part. No
> matter what I seem to try, it won't match incidences where there's a
> newline somewhere in the string. I tried all manner of things,
> including [.\n], which if I understand correctly should match
> *everything*.
>
> I'm doing this on Windows; does the carriage return/line feed business
> have anything to do with this?


Hi Jonathan.

Some points:

- The character wildcard '.' is just a dot within a character class, so [.\n]
will match only a dot or a newline

- The /s modifier will force '.' to match absolutely anything, including a
newline. So you could write:

$_ =~ /<a name="(w\d+)">\s*<\/a>\s*<h2>(.+)<\/h2>/s;

but that isn't what you want as /.+/ will eat up all of the rest of the string
until the last </h2> it finds. You could get away with /.+?/ but nicer is
/[^<]+/ which will match any number of any character except for an open angle
bracket

- If you're matching against $_ then you can omit it altogether:

/<a name="(w\d+)">\s*<\/a>\s*<h2>(.+)<\/h2>/;

does the same thing

- Enclosing a regex in slashes allows you to omit an implied m// operator, which
you have (i.e. /regex/ is the same as m/regex/). Putting the m back lets you use
whatever delimiters you want, so you don't have to escape the contained slashes
and can make it more readable:

m#<a name="(w\d+)">\s*</a>\s*<h2>([^<]+)</h2>#;

- Regexes aren't the best way of parsing HTML, unless the document is very
simple and predictable. Take a look at somthing like HTML::TreeBuilder if you're
doing this a lot on varying or non-trivial documents.

- This program does what you want:

use strict;
use warnings;

my $string = <<HTML;
<a name="w12234"> </a> <h2>A
Title</h2>
HTML

$string =~ m#<a name="(w\d+)">\s*</a>\s*<h2>([^<]+)</h2>#;

print $1, "\n";
print $2, "\n";

OUTPUT

w12234
A
Title


I hope this helps.

Rob
Dr.Ruud

2006-07-24, 6:57 pm

"Jonathan Weber" schreef:

> <a name="w12234"> </a> <h2>A Title</h2>
>
> I'm using a regular expression to find these and capture the name
> attribute ("w12234" in the example) and the contents of the h2 tag ("A
> Title").
>
> $_ =~ /<a name="(w\d+)">\s*<\/a>\s*<h2>(____+)<\/h2>/
>
> That's my regex, except I'm having trouble with the _____ part. No
> matter what I seem to try, it won't match incidences where there's a
> newline somewhere in the string. I tried all manner of things,
> including [.\n], which if I understand correctly should match
> *everything*.


Not "[.\n]" (because that contains a literal dot), but "(?:.|\n)".
Or do a "s/\n/ /g" first.

But you don't need all that, see `perldoc perlre` about the s-modifier.

And much better: use a proper HTML parser, see CPAN.

--
Affijn, Ruud

"Gewoon is een tijger."


Jonathan Weber

2006-07-24, 9:56 pm

On 24 Jul 2006, at 5:48 PM, Rob Dixon wrote:

> - The character wildcard '.' is just a dot within a character
> class, so [.\n]
> will match only a dot or a newline


Ah, I hadn't realized that characters in [ ] are literals. That
clears up a lot of the problem.

> - Regexes aren't the best way of parsing HTML, unless the document
> is very
> simple and predictable. Take a look at somthing like
> HTML::TreeBuilder if you're
> doing this a lot on varying or non-trivial documents.


Yeah, I realize that it's not the best way to go about it, but I had
many documents that were all the same, plus I figured this was a good
excuse to learn regexes.

Thanks for the help!

Jonathan
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com