Home > Archive > PERL Beginners > July 2006 > regular expression help
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
regular expression help
|
|
| Jonathan Weber 2006-07-24, 6:57 pm |
| Hi. I have some HTML files with lines like the following:
<a name="w12234"> </a> <h2>A Title</h2>
I'm using a regular expression to find these and capture the name
attribute ("w12234" in the example) and the contents of the h2 tag ("A
Title").
$_ =~ /<a name="(w\d+)">\s*<\/a>\s*<h2>(____+)<\/h2>/
That's my regex, except I'm having trouble with the _____ part. No
matter what I seem to try, it won't match incidences where there's a
newline somewhere in the string. I tried all manner of things,
including [.\n], which if I understand correctly should match
*everything*.
I'm doing this on Windows; does the carriage return/line feed business
have anything to do with this?
Thanks in advance.
| |
| Paul Lalli 2006-07-24, 6:57 pm |
| Jonathan Weber wrote:
> Hi. I have some HTML files with lines like the following:
>
> <a name="w12234"> </a> <h2>A Title</h2>
>
> I'm using a regular expression to find these and capture the name
> attribute ("w12234" in the example) and the contents of the h2 tag ("A
> Title").
>
> $_ =~ /<a name="(w\d+)">\s*<\/a>\s*<h2>(____+)<\/h2>/
>
> That's my regex, except I'm having trouble with the _____ part. No
> matter what I seem to try, it won't match incidences where there's a
> newline somewhere in the string.
Have you actually examined each individual string to verify whether or
not there *is* a newline "in" the string? I'm guessing not. I'm
guessing you're processing this HTML line-by-line, meaning that you
have one string that looks like:
<a name="w12234"> </a> <h2>A
and then the next iteration's string looks like:
Title</h2>
Obviously, neither of those strings are going to match your pattern.
You have two basic options (and countless more difficult ones):
1) read the entire file into one big scalar, and do a single
progressive pattern match on that scalar
2) much preferred - stop trying to use regular expressions to parse
HTML. Use an HTML parser, like, for example, HTML::Parser (though, I
recommend HTML::TokeParser for a slightly easier interface)
> I tried all manner of things,
> including [.\n], which if I understand correctly should match
> *everything*.
It does, but it can't match what's not there.
> I'm doing this on Windows; does the carriage return/line feed business
> have anything to do with this?
No, your logic error has everything to do with it.
Paul Lalli
P.S. Of course, since you didn't bother to show a short-but-complete
script that demonstrates your error, and instead decided to show us
only what *you* think is the cause of the error (and really, if you
knew the cause of the error, would you be posting in the first place?),
everything I said above is a complete guess.
| |
| Rob Dixon 2006-07-24, 6:57 pm |
| Jonathan Weber wrote:
>
> Hi. I have some HTML files with lines like the following:
>
> <a name="w12234"> </a> <h2>A Title</h2>
>
> I'm using a regular expression to find these and capture the name
> attribute ("w12234" in the example) and the contents of the h2 tag ("A
> Title").
>
> $_ =~ /<a name="(w\d+)">\s*<\/a>\s*<h2>(____+)<\/h2>/
>
> That's my regex, except I'm having trouble with the _____ part. No
> matter what I seem to try, it won't match incidences where there's a
> newline somewhere in the string. I tried all manner of things,
> including [.\n], which if I understand correctly should match
> *everything*.
>
> I'm doing this on Windows; does the carriage return/line feed business
> have anything to do with this?
Hi Jonathan.
Some points:
- The character wildcard '.' is just a dot within a character class, so [.\n]
will match only a dot or a newline
- The /s modifier will force '.' to match absolutely anything, including a
newline. So you could write:
$_ =~ /<a name="(w\d+)">\s*<\/a>\s*<h2>(.+)<\/h2>/s;
but that isn't what you want as /.+/ will eat up all of the rest of the string
until the last </h2> it finds. You could get away with /.+?/ but nicer is
/[^<]+/ which will match any number of any character except for an open angle
bracket
- If you're matching against $_ then you can omit it altogether:
/<a name="(w\d+)">\s*<\/a>\s*<h2>(.+)<\/h2>/;
does the same thing
- Enclosing a regex in slashes allows you to omit an implied m// operator, which
you have (i.e. /regex/ is the same as m/regex/). Putting the m back lets you use
whatever delimiters you want, so you don't have to escape the contained slashes
and can make it more readable:
m#<a name="(w\d+)">\s*</a>\s*<h2>([^<]+)</h2>#;
- Regexes aren't the best way of parsing HTML, unless the document is very
simple and predictable. Take a look at somthing like HTML::TreeBuilder if you're
doing this a lot on varying or non-trivial documents.
- This program does what you want:
use strict;
use warnings;
my $string = <<HTML;
<a name="w12234"> </a> <h2>A
Title</h2>
HTML
$string =~ m#<a name="(w\d+)">\s*</a>\s*<h2>([^<]+)</h2>#;
print $1, "\n";
print $2, "\n";
OUTPUT
w12234
A
Title
I hope this helps.
Rob
| |
| Dr.Ruud 2006-07-24, 6:57 pm |
| "Jonathan Weber" schreef:
> <a name="w12234"> </a> <h2>A Title</h2>
>
> I'm using a regular expression to find these and capture the name
> attribute ("w12234" in the example) and the contents of the h2 tag ("A
> Title").
>
> $_ =~ /<a name="(w\d+)">\s*<\/a>\s*<h2>(____+)<\/h2>/
>
> That's my regex, except I'm having trouble with the _____ part. No
> matter what I seem to try, it won't match incidences where there's a
> newline somewhere in the string. I tried all manner of things,
> including [.\n], which if I understand correctly should match
> *everything*.
Not "[.\n]" (because that contains a literal dot), but "(?:.|\n)".
Or do a "s/\n/ /g" first.
But you don't need all that, see `perldoc perlre` about the s-modifier.
And much better: use a proper HTML parser, see CPAN.
--
Affijn, Ruud
"Gewoon is een tijger."
| |
| Jonathan Weber 2006-07-24, 9:56 pm |
| On 24 Jul 2006, at 5:48 PM, Rob Dixon wrote:
> - The character wildcard '.' is just a dot within a character
> class, so [.\n]
> will match only a dot or a newline
Ah, I hadn't realized that characters in [ ] are literals. That
clears up a lot of the problem.
> - Regexes aren't the best way of parsing HTML, unless the document
> is very
> simple and predictable. Take a look at somthing like
> HTML::TreeBuilder if you're
> doing this a lot on varying or non-trivial documents.
Yeah, I realize that it's not the best way to go about it, but I had
many documents that were all the same, plus I figured this was a good
excuse to learn regexes.
Thanks for the help!
Jonathan
|
|
|
|
|