Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

deleting HTML tag...but not everyone
Hi guys,

I have a problem with a Regular expression.
I have to delete from a text all HTML tags but not the DIV one (keeping all 
the parameters in the
tag).

I've done this:

 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#!/usr/bin/perl
use strict;
my $test=<<EOS;
<html><head><meta content="MSHTML 6.00.2800.1400" name="GENERATOR">
</head><body><font face="Courier New" size=2>
=========SUPER SAVING========= <br>
-product one <br>
-product two <br><D>
-product three <br><dIV section=true>
============================== <Br></DIV>
<br><br></font></body> </html>
EOS
$test=~s/<br>/\n/ig;
$test=~s/<^[DIV](.*?)>//ig;
print $test;
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
with this I can hav ALMOST what I want.
I delete all HTML tags but <DIV> one but I also keep a <D> tag and I delete 
the </DIV> tag that I
would like to keep

The problem is in the ^[DIV] part of my regex....the "DIV" string is used as list
 of chars and not
as whole world. Is there a way to archieve my goal?

tnx in advance
Francesco



__________________________________
Do you Yahoo!?
Yahoo! Mail - 50x more storage than other providers!
http://promotions.yahoo.com/new_mail

Report this thread to moderator Post Follow-up to this message
Old Post
Francesco Del Vecchio
07-29-04 08:56 PM


Re: deleting HTML tag...but not everyone
From: Francesco del Vecchio <f_delvecchio@yahoo.com>
> I have a problem with a Regular expression.
> I have to delete from a text all HTML tags but not the DIV one
> (keeping all the parameters in the tag).

Don't do that!

You should use a HTML parser module instead of regexps. Parsing HTML
is not as trivial as it may seem.


You may like HTML::JFilter (based on HTML::Parser):

use HTML::JFilter;
$filter = new HTML::JFilter <<'*END*'
div: section style
*END*
$filteredHTML = $filter->doSTRING($enteredHTML);

# http://jenda.krynicky.cz/#HTML::JFilter

> I've done this:
>
>  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> #!/usr/bin/perl use strict; my $test=<<EOS; <html><head><meta
> content="MSHTML 6.00.2800.1400" name="GENERATOR"> </head><body><font
> face="Courier New" size=2> =========SUPER SAVING========= <br>
> -product one <br> -product two <br><D> -product three <br><dIV
> section=true> ============================== <Br></DIV>
> <br><br></font></body> </html> EOS $test=~s/<br>/\n/ig;
> $test=~s/<^[DIV](.*?)>//ig; print $test;
>  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> with this I can hav ALMOST what I want. I delete all HTML tags but
> <DIV> one but I also keep a <D> tag and I delete the </DIV> tag that I
> would like to keep
>
> The problem is in the ^[DIV] part of my regex....the "DIV" string is
> used as list of chars and not as whole world. Is there a way to
> archieve my goal?

Drop the []. [] means group of chars.

Also the ^ means something only at the beginning of a regexp or a
group.
In this case you would have to use a positive look-ahead.

Read
perldoc perlretut
perldoc perlre

Jenda
===== Jenda@Krynicky.cz === http://Jenda.Krynicky.cz =====
When it comes to wine, women and song, wizards are allowed
to get drunk and croon as much as they like.
-- Terry Pratchett in Sourcery


Report this thread to moderator Post Follow-up to this message
Old Post
Jenda Krynicky
07-29-04 08:56 PM


Re: deleting HTML tag...but not everyone
On Jul 29, 2004, at 7:52 AM, Francesco del Vecchio wrote:

> Hi guys,

Hello.

> I have a problem with a Regular expression.
> I have to delete from a text all HTML tags but not the DIV one
> (keeping all the parameters in the tag).

This is a complex problem.  Your solution is pretty naive and will only
work on a tight set of HTML, formatted as you expect it to be.

I'm not saying that's a problem.  If you know your HTML will stay
simple, it isn't.

However, if you need or even think you may someday need a more robust
approach, you should check out the HTML parsing modules on the CPAN.

> I've done this:
>
>  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> #!/usr/bin/perl
> use strict;

I would add:

use warnings;

This doesn't do anything for you here, but it's a good habit to build.
It often makes finding errors much easier.

> my $test=<<EOS;
> <html><head><meta content="MSHTML 6.00.2800.1400" name="GENERATOR">
> </head><body><font face="Courier New" size=2>
> =========SUPER SAVING========= <br>
> -product one <br>
> -product two <br><D>
> -product three <br><dIV section=true>
> ============================== <Br></DIV>
> <br><br></font></body> </html>
> EOS
> $test=~s/<br>/\n/ig;

A little less naive might be:

$test =~ s/<\s*br\s*>/\n/ig;

Even that wouldn't catch the now common <br /> though.  Again, use a
module if this kind of thing is important.

> $test=~s/<^[DIV](.*?)>//ig;

This is currently removing zero tags.  You are asking for a <, followed
by the beginning of the string (^).  That is impossible, and thus never
matches.  I believe you meant [^DIV]+, which means one or more non D,
I, or V characters, but that won't work either for reasons you pointed
out.

Here's a simple fix:

$test =~ s/<(?!\/?DIV)[^>]+>//ig;

That searches for a <, then uses a negative look-ahead assertion to
verify that a DIV or /DIV is not next, and finally grabs everything up
to the next >.  It works on the example you provided.

I know I sound like a broken record, but I must again stress how weak
this is.  If the HTML contains a < DIV> (note the space), it won't work
properly.  Again, parsing HTML is painful, use a module and benefit
from the suffering of others if you need an intelligent solution.

> print $test;
>  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Hope that helps.

James

P.S.  You can use whitespace (blanks lines and spaces) to pretty up
your code a little.  Your eyes will thank you.  Don't worry, it's free!
;)


Report this thread to moderator Post Follow-up to this message
Old Post
James Edward Gray II
07-29-04 08:56 PM


Sponsored Links




Last Thread Next Thread Next
Search this forum -> 
Post New Thread

PERL Beginners archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 04:30 PM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.