For Programmers: Free Programming Magazines  


Home > Archive > PERL Miscellaneous > December 2004 > RegEx Help Needed









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author RegEx Help Needed
DeepDiver

2004-12-04, 3:57 am

I'm trying to parse a string of HTML that contains a mix of tags and text.
My goal is to match and replace double quote marks in the text (but not
within the tags) and replace them with the equivalent html character entity
(i.e., ").

For example, this string:
The "slow" red fox.<div class="test">The "quick" brown fox.</div>

would become this:
The "slow" red fox.<div class="test">The "quick"
brown fox.</div>

TIA!!!


Sherm Pendley

2004-12-04, 3:57 am

DeepDiver wrote:

> I'm trying to parse a string of HTML


Have a look at HTML::Parser on CPAN.

sherm--

--
Cocoa programming in Perl: http://camelbones.sourceforge.net
Hire me! My resume: http://www.dot-app.org
DeepDiver

2004-12-04, 3:57 am

"Sherm Pendley" <spamtrap@dot-app.org> wrote in message
news:SOydnYD65MRH0izcRVn-tg@adelphia.com...
>
> Have a look at HTML::Parser on CPAN.
>


Thanks, but I'm in need of a pure RegEx solution.


Lars Eighner

2004-12-04, 3:57 am

In our last episode, <JZbsd.9270$_3.108493@typhoon.sonic.net>, the lovely
and talented DeepDiver broadcast on comp.lang.perl.misc:

> I'm trying to parse a string of HTML that contains a mix of tags and text.
> My goal is to match and replace double quote marks in the text (but not
> within the tags) and replace them with the equivalent html character entity
> (i.e., ").


> For example, this string:
> The "slow" red fox.<div class="test">The "quick" brown fox.</div>


> would become this:
> The "slow" red fox.<div class="test">The "quick"
> brown fox.</div>


> TIA!!!


I can't do it in one, but --

WARNING! Those offended by brute force ugliness should look away now!
WARNING!

goodwill~/test$perl -wpi -e '$/=undef;while( s/\"([^<>]*< )/"\;$1/g ){}
;' test.html

This won't work if you have unbalanced <s and/or > anywhere in the
document such as a script with something like document.write("<")
or simply unclosed tags. If you actually run this as a one-liner,
beware of what your shell may do with $1 if you double quote the
executable.


--
Lars Eighner -finger for g code- eighner@io.com http://www.io.com/~eighner/
War on Terrorism: Camp Follower
"I am ... a total sucker for the guys ... with all the ribbons on and stuff,
and they say it's true and I'm ready to believe it. -Cokie Roberts,_ABC_
David H. Adler

2004-12-04, 3:57 am

On 2004-12-04, DeepDiver <no-spam@sonic.net> wrote:
> "Sherm Pendley" <spamtrap@dot-app.org> wrote in message
> news:SOydnYD65MRH0izcRVn-tg@adelphia.com...
>
> Thanks, but I'm in need of a pure RegEx solution.


This of course raises the question: Why?

We can probably help you better if we have some idea of why you reject
the generally accepted solution...

dha

--
David H. Adler - <dha@panix.com> - http://www.panix.com/~dha/
[Insert Angus Prune Tune here]
DeepDiver

2004-12-07, 4:10 am

"David H. Adler" <dha@panix.com> wrote in message
news:slrncr2pos.j2i.dha@panix2.panix.com...
> On 2004-12-04, DeepDiver <no-spam@sonic.net> wrote:
>
> This of course raises the question: Why?



A few reasons:

1. I'm not programming in Perl. In fact, my experience with Perl was a long
time ago (and not very extensive even then). I came here because I believe
that Perl programmers are generally the most proficient with regular
expressions.

2. I'm writing the current routine in C#. But I would still prefer a "pure"
RegEx solution so that I have something that is concise and (higher-level)
language independent.

3. I'm trying to improve my RegEx skills, so the more I can learn how to do
things like this in RegEx (without "massaging" in a higher-level language)
the better.

I hope this addresses your concerns.

Thanks,
Michael


Sherm Pendley

2004-12-07, 4:10 am

DeepDiver wrote:

> 1. I'm not programming in Perl.
>
> 2. I'm writing the current routine in C#.


This is a Perl group. The C# group is down the hall to the left. Don't
let the door hit you on the way out.

sherm--

--
Cocoa programming in Perl: http://camelbones.sourceforge.net
Hire me! My resume: http://www.dot-app.org
Joe Smith

2004-12-07, 4:11 am

DeepDiver wrote:

> 1. I came here because I believe
> that Perl programmers are generally the most proficient with regular
> expressions.


Regular expressions as implemented in other languages are not the same.

Using just a regular expression won't cut it; correct parsing usually
requires program logic as well.
-Joe
Tassilo v. Parseval

2004-12-07, 4:11 am

Also sprach DeepDiver:

> "David H. Adler" <dha@panix.com> wrote in message
> news:slrncr2pos.j2i.dha@panix2.panix.com...
>
>
> A few reasons:
>
> 1. I'm not programming in Perl. In fact, my experience with Perl was a long
> time ago (and not very extensive even then). I came here because I believe
> that Perl programmers are generally the most proficient with regular
> expressions.


This nonetheless makes your posting rather off-topic in this group. Perl
did not invent regular expressions. Also, Perl regular expressions are
likely to be more powerful than regular expressions found in other
languages. This means you probably couldn't use a regex solution
from this group in your program.

> 2. I'm writing the current routine in C#. But I would still prefer a "pure"
> RegEx solution so that I have something that is concise and (higher-level)
> language independent.


I have my doubts as to the conciseness of a pure regex solution.
Classical reguar expressions aren't even remotely powerful enough to
parse HTML (and there's not much to argue about : It can be proven with
the famous Pumping lemma). Perl's regular expressions might be powerful
enough as they have some non-regular extensions (they allow
back-references, they can be recursive etc.). Still, a regex solution
could hardly be robust. Let alone the fact that .NET regular expressions
lack many of the Perl features.

Tassilo
--
$_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
pam{rekcahbus})(rekcah{lrePbus})(lreP{re
htonabus})!JAPH!qq(rehtona{tsuJbus#;
$_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexiixesixeseg;y~\n~~dddd;eval
Alan J. Flavell

2004-12-07, 4:11 am

On Sat, 4 Dec 2004, Tassilo v. Parseval wrote:

> Perl regular expressions are likely to be more powerful than regular
> expressions found in other languages.


Would this be a moment to mention PCRE, http://www.pcre.org/ ?

"Perl Compatible Regular Expressions" library.

I often use its diagnostic command, "pcretest", to explore the
behaviour of some complex regex that I'm working with, when fed with
various data. Whether the regex is meant for Perl or, indeed, when
writing ACLs for the same author's excellent MTA, exim.

(Of course, that has nothing to do with attempting to use regexes for
parsing arbitrary HTML - which is ultimately hopeless.)
Jürgen Exner

2004-12-07, 4:11 am

DeepDiver wrote:
[About parsing HTML]
> "Sherm Pendley" <spamtrap@dot-app.org> wrote in message
> news:SOydnYD65MRH0izcRVn-tg@adelphia.com...
>
> Thanks, but I'm in need of a pure RegEx solution.


Forget it. Nobody with a sane mind would try parsing HTML using pure REs.
Contrary to popular believe parsing HTML is non-trivial and while it is not
decided yet if Perl's advanced REs are powerful enough to do it, most
certainly it would be _way_ too complex to be of any real use.
As this has been discussed many times before please see the FAQ and Google
for further details .

jue


Chris Mattern

2004-12-07, 4:11 am

DeepDiver wrote:

> "Sherm Pendley" <spamtrap@dot-app.org> wrote in message
> news:SOydnYD65MRH0izcRVn-tg@adelphia.com...
>
> Thanks, but I'm in need of a pure RegEx solution.


No, you aren't. You may think you are, but you aren't.
--
Christopher Mattern

"Which one you figure tracked us?"
"The ugly one, sir."
"...Could you be more specific?"
Chris Mattern

2004-12-07, 4:11 am

DeepDiver wrote:

> "David H. Adler" <dha@panix.com> wrote in message
> news:slrncr2pos.j2i.dha@panix2.panix.com...
>
>
> A few reasons:
>
> 1. I'm not programming in Perl. In fact, my experience with Perl was a
> long time ago (and not very extensive even then). I came here because I
> believe that Perl programmers are generally the most proficient with
> regular expressions.


Regular expressions differ subtly but significantly between the languages
that implement them. Solutions formulated for Perl regular expressions
would have a good chance of not working in your language. Ask in a
forum that deals with your language.
>
> 2. I'm writing the current routine in C#. But I would still prefer a
> "pure" RegEx solution so that I have something that is concise and
> (higher-level) language independent.


See above about the portability of regular expressions.
>
> 3. I'm trying to improve my RegEx skills, so the more I can learn how to
> do things like this in RegEx (without "massaging" in a higher-level
> language) the better.


Regular expressions are a very poor tool for parsing HTML. Depending
on your task, using them to do so will range from hair-tearing frustrating
to simply impossible. Parsing HTML is not a trivial task. The main
lesson you would learn trying to parse HTML with regular expressions would
be, if you were paying attention, "don't parse HTML with regular
expressions".
>
> I hope this addresses your concerns.


Hope these address yours.
>
> Thanks,
> Michael


--
Christopher Mattern

"Which one you figure tracked us?"
"The ugly one, sir."
"...Could you be more specific?"
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com