Home > Archive > PERL Beginners > July 2004 > deleting HTML tag...but not everyone
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
deleting HTML tag...but not everyone
|
|
| Francesco Del Vecchio 2004-07-29, 3:56 pm |
| Hi guys,
I have a problem with a Regular expression.
I have to delete from a text all HTML tags but not the DIV one (keeping all the parameters in the
tag).
I've done this:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#!/usr/bin/perl
use strict;
my $test=<<EOS;
<html><head><meta content="MSHTML 6.00.2800.1400" name="GENERATOR">
</head><body><font face="Courier New" size=2>
=========SUPER SAVING========= <br>
-product one <br>
-product two <br><D>
-product three <br><dIV section=true>
============================== <Br></DIV>
<br><br></font></body> </html>
EOS
$test=~s/<br>/\n/ig;
$test=~s/<^[DIV](.*?)>//ig;
print $test;
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
with this I can hav ALMOST what I want.
I delete all HTML tags but <DIV> one but I also keep a <D> tag and I delete the </DIV> tag that I
would like to keep
The problem is in the ^[DIV] part of my regex....the "DIV" string is used as list of chars and not
as whole world. Is there a way to archieve my goal?
tnx in advance
Francesco
__________________________________
Do you Yahoo!?
Yahoo! Mail - 50x more storage than other providers!
http://promotions.yahoo.com/new_mail
| |
| Jenda Krynicky 2004-07-29, 3:56 pm |
| From: Francesco del Vecchio <f_delvecchio@yahoo.com>
> I have a problem with a Regular expression.
> I have to delete from a text all HTML tags but not the DIV one
> (keeping all the parameters in the tag).
Don't do that!
You should use a HTML parser module instead of regexps. Parsing HTML
is not as trivial as it may seem.
You may like HTML::JFilter (based on HTML::Parser):
use HTML::JFilter;
$filter = new HTML::JFilter <<'*END*'
div: section style
*END*
$filteredHTML = $filter->doSTRING($enteredHTML);
# http://jenda.krynicky.cz/#HTML::JFilter
> I've done this:
>
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> #!/usr/bin/perl use strict; my $test=<<EOS; <html><head><meta
> content="MSHTML 6.00.2800.1400" name="GENERATOR"> </head><body><font
> face="Courier New" size=2> =========SUPER SAVING========= <br>
> -product one <br> -product two <br><D> -product three <br><dIV
> section=true> ============================== <Br></DIV>
> <br><br></font></body> </html> EOS $test=~s/<br>/\n/ig;
> $test=~s/<^[DIV](.*?)>//ig; print $test;
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> with this I can hav ALMOST what I want. I delete all HTML tags but
> <DIV> one but I also keep a <D> tag and I delete the </DIV> tag that I
> would like to keep
>
> The problem is in the ^[DIV] part of my regex....the "DIV" string is
> used as list of chars and not as whole world. Is there a way to
> archieve my goal?
Drop the []. [] means group of chars.
Also the ^ means something only at the beginning of a regexp or a
group.
In this case you would have to use a positive look-ahead.
Read
perldoc perlretut
perldoc perlre
Jenda
===== Jenda@Krynicky.cz === http://Jenda.Krynicky.cz =====
When it comes to wine, women and song, wizards are allowed
to get drunk and croon as much as they like.
-- Terry Pratchett in Sourcery
| |
| James Edward Gray II 2004-07-29, 3:56 pm |
| On Jul 29, 2004, at 7:52 AM, Francesco del Vecchio wrote:
> Hi guys,
Hello.
> I have a problem with a Regular expression.
> I have to delete from a text all HTML tags but not the DIV one
> (keeping all the parameters in the tag).
This is a complex problem. Your solution is pretty naive and will only
work on a tight set of HTML, formatted as you expect it to be.
I'm not saying that's a problem. If you know your HTML will stay
simple, it isn't.
However, if you need or even think you may someday need a more robust
approach, you should check out the HTML parsing modules on the CPAN.
> I've done this:
>
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> #!/usr/bin/perl
> use strict;
I would add:
use warnings;
This doesn't do anything for you here, but it's a good habit to build.
It often makes finding errors much easier.
> my $test=<<EOS;
> <html><head><meta content="MSHTML 6.00.2800.1400" name="GENERATOR">
> </head><body><font face="Courier New" size=2>
> =========SUPER SAVING========= <br>
> -product one <br>
> -product two <br><D>
> -product three <br><dIV section=true>
> ============================== <Br></DIV>
> <br><br></font></body> </html>
> EOS
> $test=~s/<br>/\n/ig;
A little less naive might be:
$test =~ s/<\s*br\s*>/\n/ig;
Even that wouldn't catch the now common <br /> though. Again, use a
module if this kind of thing is important.
> $test=~s/<^[DIV](.*?)>//ig;
This is currently removing zero tags. You are asking for a <, followed
by the beginning of the string (^). That is impossible, and thus never
matches. I believe you meant [^DIV]+, which means one or more non D,
I, or V characters, but that won't work either for reasons you pointed
out.
Here's a simple fix:
$test =~ s/<(?!\/?DIV)[^>]+>//ig;
That searches for a <, then uses a negative look-ahead assertion to
verify that a DIV or /DIV is not next, and finally grabs everything up
to the next >. It works on the example you provided.
I know I sound like a broken record, but I must again stress how weak
this is. If the HTML contains a < DIV> (note the space), it won't work
properly. Again, parsing HTML is painful, use a module and benefit
from the suffering of others if you need an intelligent solution.
> print $test;
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Hope that helps.
James
P.S. You can use whitespace (blanks lines and spaces) to pretty up
your code a little. Your eyes will thank you. Don't worry, it's free!
;)
|
|
|
|
|