Code Comments
Programming Forum and web based access to our favorite programming groups.Hi guys, I have a problem with a Regular expression. I have to delete from a text all HTML tags but not the DIV one (keeping all the parameters in the tag). I've done this: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #!/usr/bin/perl use strict; my $test=<<EOS; <html><head><meta content="MSHTML 6.00.2800.1400" name="GENERATOR"> </head><body><font face="Courier New" size=2> =========SUPER SAVING========= <br> -product one <br> -product two <br><D> -product three <br><dIV section=true> ============================== <Br></DIV> <br><br></font></body> </html> EOS $test=~s/<br>/\n/ig; $test=~s/<^[DIV](.*?)>//ig; print $test; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ with this I can hav ALMOST what I want. I delete all HTML tags but <DIV> one but I also keep a <D> tag and I delete the </DIV> tag that I would like to keep The problem is in the ^[DIV] part of my regex....the "DIV" string is used as list of chars and not as whole world. Is there a way to archieve my goal? tnx in advance Francesco __________________________________ Do you Yahoo!? Yahoo! Mail - 50x more storage than other providers! http://promotions.yahoo.com/new_mail
Post Follow-up to this messageFrom: Francesco del Vecchio <f_delvecchio@yahoo.com> > I have a problem with a Regular expression. > I have to delete from a text all HTML tags but not the DIV one > (keeping all the parameters in the tag). Don't do that! You should use a HTML parser module instead of regexps. Parsing HTML is not as trivial as it may seem. You may like HTML::JFilter (based on HTML::Parser): use HTML::JFilter; $filter = new HTML::JFilter <<'*END*' div: section style *END* $filteredHTML = $filter->doSTRING($enteredHTML); # http://jenda.krynicky.cz/#HTML::JFilter > I've done this: > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > #!/usr/bin/perl use strict; my $test=<<EOS; <html><head><meta > content="MSHTML 6.00.2800.1400" name="GENERATOR"> </head><body><font > face="Courier New" size=2> =========SUPER SAVING========= <br> > -product one <br> -product two <br><D> -product three <br><dIV > section=true> ============================== <Br></DIV> > <br><br></font></body> </html> EOS $test=~s/<br>/\n/ig; > $test=~s/<^[DIV](.*?)>//ig; print $test; > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > with this I can hav ALMOST what I want. I delete all HTML tags but > <DIV> one but I also keep a <D> tag and I delete the </DIV> tag that I > would like to keep > > The problem is in the ^[DIV] part of my regex....the "DIV" string is > used as list of chars and not as whole world. Is there a way to > archieve my goal? Drop the []. [] means group of chars. Also the ^ means something only at the beginning of a regexp or a group. In this case you would have to use a positive look-ahead. Read perldoc perlretut perldoc perlre Jenda ===== Jenda@Krynicky.cz === http://Jenda.Krynicky.cz ===== When it comes to wine, women and song, wizards are allowed to get drunk and croon as much as they like. -- Terry Pratchett in Sourcery
Post Follow-up to this messageOn Jul 29, 2004, at 7:52 AM, Francesco del Vecchio wrote: > Hi guys, Hello. > I have a problem with a Regular expression. > I have to delete from a text all HTML tags but not the DIV one > (keeping all the parameters in the tag). This is a complex problem. Your solution is pretty naive and will only work on a tight set of HTML, formatted as you expect it to be. I'm not saying that's a problem. If you know your HTML will stay simple, it isn't. However, if you need or even think you may someday need a more robust approach, you should check out the HTML parsing modules on the CPAN. > I've done this: > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > #!/usr/bin/perl > use strict; I would add: use warnings; This doesn't do anything for you here, but it's a good habit to build. It often makes finding errors much easier. > my $test=<<EOS; > <html><head><meta content="MSHTML 6.00.2800.1400" name="GENERATOR"> > </head><body><font face="Courier New" size=2> > =========SUPER SAVING========= <br> > -product one <br> > -product two <br><D> > -product three <br><dIV section=true> > ============================== <Br></DIV> > <br><br></font></body> </html> > EOS > $test=~s/<br>/\n/ig; A little less naive might be: $test =~ s/<\s*br\s*>/\n/ig; Even that wouldn't catch the now common <br /> though. Again, use a module if this kind of thing is important. > $test=~s/<^[DIV](.*?)>//ig; This is currently removing zero tags. You are asking for a <, followed by the beginning of the string (^). That is impossible, and thus never matches. I believe you meant [^DIV]+, which means one or more non D, I, or V characters, but that won't work either for reasons you pointed out. Here's a simple fix: $test =~ s/<(?!\/?DIV)[^>]+>//ig; That searches for a <, then uses a negative look-ahead assertion to verify that a DIV or /DIV is not next, and finally grabs everything up to the next >. It works on the example you provided. I know I sound like a broken record, but I must again stress how weak this is. If the HTML contains a < DIV> (note the space), it won't work properly. Again, parsing HTML is painful, use a module and benefit from the suffering of others if you need an intelligent solution. > print $test; > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Hope that helps. James P.S. You can use whitespace (blanks lines and spaces) to pretty up your code a little. Your eyes will thank you. Don't worry, it's free! ;)
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread
Powered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.