For Programmers: Free Programming Magazines  


Home > Archive > PERL Miscellaneous > July 2004 > HTML regex challenge









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author HTML regex challenge
Max Metral

2004-07-28, 9:00 pm

I'm matching some ASP.net code with some perl regex's to do localization.
I'm having some trouble with asp's embedded use of <% %> and differentiating
it from the html tag... So, the thing I'm matching is like:

<tag a=b c="d">stuff</tag>

My reg ex is:

<tag([^>]*?)>(.*?)</tag>

Which works fine for the first example. But it doesn't for this:

<tag a=b c="<%foo%>">stuff</tag>

As expected, it stops after %>. Question is, how can I modify the
expression to still get the whole "attribute section" in that single
match... I've tried various back reference constructs, but they don't seem
to do it. The expression fragment I want is "match everything except right
bracket, unless there was a % before the right bracket"...

Hrmph,
--Max


Bob Walton

2004-07-28, 9:00 pm

Max Metral wrote:

> I'm matching some ASP.net code with some perl regex's to do localization.
> I'm having some trouble with asp's embedded use of <% %> and differentiating
> it from the html tag... So, the thing I'm matching is like:
>
> <tag a=b c="d">stuff</tag>
>
> My reg ex is:
>
> <tag([^>]*?)>(.*?)</tag>
>
> Which works fine for the first example. But it doesn't for this:
>
> <tag a=b c="<%foo%>">stuff</tag>
>
> As expected, it stops after %>. Question is, how can I modify the
> expression to still get the whole "attribute section" in that single
> match... I've tried various back reference constructs, but they don't seem
> to do it. The expression fragment I want is "match everything except right
> bracket, unless there was a % before the right bracket"...

....


> --Max


Well, there's really only one way to do it right: Parse the HTML.
There are *bunches* of other cases that can bite you besides the one you
found, and, in general, it is most difficult to handle them all,
particularly in a single regexp. Actually, it is probably difficult to
even know about them all. See:

perldoc HTML::Parser
perldoc -q HTML

The latter document has a few of the possible trip-ups listed.
--
Bob Walton
Email: http://bwalton.com/cgi-bin/emailbob.pl

Tad McClellan

2004-07-28, 9:00 pm

Max Metral <memetral@hotmail.com> wrote:

> Subject: HTML regex challenge



Parsing arbitrary HTML with a regex is nearly impossible.

You need a Real Parser that knows the HTML grammar.


> The expression fragment I want is "match everything except right
> bracket, unless there was a % before the right bracket"...



Your problem description will not do the Right Thing for this HTML:

<img src=".jpg" alt=">>Cool pic!<<">

after you fix the regex for that case, post it here and we
will show some other HTML that breaks it.

Then after you fix the regex for _that_ case, post the regex
and we'll do it again.

Lather, rinse, repeat.

We can keep that up longer than you can. :-)


--
Tad McClellan SGML consulting
tadmc@augustmail.com Perl programming
Fort Worth, Texas
Max Metral

2004-07-28, 9:00 pm

Understood. To argue my case only slightly more, I'm not parsing arbitrary
html, I'm looking for a single tag called "localize" which I the replace the
contents of with the contents of an XML entry from a resource file. So
there's never a case where > appears in an attribute of that tag, UNLESS
it's inside an ASP block (<% %> ). The attributes of the localize tag are
very restricted, true/false type things, except for the fact that somebody
may need to "bind" one of these true/falses to a functon call.

So my latest is:
<localize((?:[^>]*%>[^>])*[^>]*)>(.*?)</localize>

which fixes my original problem, but it's true that that won't handle

<localize visible="<%# x > 5%>">foo</localize>

but that seems fixable and "final", in that that's the only case that could
occur given the allowable values of the tag...

The problem with most HTML parsers is that (shocker) they don't handle
ASP.Net (which isn't HTML)... So rather than modding something big I was
hoping to keep it simple, even if that means constraining the user of the
tag somewhat.

"Tad McClellan" <tadmc@augustmail.com> wrote in message
news:slrncg5j0d.7h7.tadmc@magna.augustmail.com...
> Max Metral <memetral@hotmail.com> wrote:
>
>
>
> Parsing arbitrary HTML with a regex is nearly impossible.
>
> You need a Real Parser that knows the HTML grammar.
>
>
>
>
> Your problem description will not do the Right Thing for this HTML:
>
> <img src=".jpg" alt=">>Cool pic!<<">
>
> after you fix the regex for that case, post it here and we
> will show some other HTML that breaks it.
>
> Then after you fix the regex for _that_ case, post the regex
> and we'll do it again.
>
> Lather, rinse, repeat.
>
> We can keep that up longer than you can. :-)
>
>
> --
> Tad McClellan SGML consulting
> tadmc@augustmail.com Perl programming
> Fort Worth, Texas



ko

2004-07-28, 9:00 pm

Max Metral wrote:
> Understood. To argue my case only slightly more, I'm not parsing arbitrary
> html, I'm looking for a single tag called "localize" which I the replace the
> contents of with the contents of an XML entry from a resource file. So
> there's never a case where > appears in an attribute of that tag, UNLESS
> it's inside an ASP block (<% %> ). The attributes of the localize tag are
> very restricted, true/false type things, except for the fact that somebody
> may need to "bind" one of these true/falses to a functon call.
>
> So my latest is:
> <localize((?:[^>]*%>[^>])*[^>]*)>(.*?)</localize>
>
> which fixes my original problem, but it's true that that won't handle
>
> <localize visible="<%# x > 5%>">foo</localize>
>
> but that seems fixable and "final", in that that's the only case that could
> occur given the allowable values of the tag...


Generally, you can't argue against the advice to use a HTML module to
parse HTML.

If you *really* want to use a regex, (now that you have elaborated
you're looking for a single, *specific* instance) there are modules on
CPAN that make this job easier, one being Regexp::Common:

use strict;
use warnings;
use Regexp::Common qw /balanced/;

my $text = q[<localize visible="<%# x > 5%>">foo</localize>];
(my $changed = $text) =~
s/$RE{balanced}{-begin => '<%'}{-end => '%>'}{-keep}
/changed text/x;
print $changed . "\n";

Also, when replying to someone please keep the content you quote at the
top and your reply on the bottom. Your reply to Tad is an example of
top-posting, which is covered in the group's posting guidelines
available here:

http://mail.augustmail.com/~tadmc/clpmisc.shtml

HTH - keith
Tore Aursand

2004-07-28, 9:00 pm

On Sat, 24 Jul 2004 13:21:11 -0400, Max Metral wrote:
> My reg ex is:
>
> <tag([^>]*?)>(.*?)</tag>
>
> Which works fine for the first example. But it doesn't for this:
>
> <tag a=b c="<%foo%>">stuff</tag>


Hint: Think right-left, not left-right.


--
Tore Aursand <tore@aursand.no>
"Life is pleasant. Death is peaceful. It's the transition that's
troublesome." (Isaac Asimov)
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com