For Programmers: Free Programming Magazines  


Home > Archive > PERL CGI Beginners > January 2006 > regex nth match of string, perl 5.8.5









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author regex nth match of string, perl 5.8.5
Chris Cosner

2006-01-23, 6:55 pm

Hello,

I have some data coming into a system with markup that I need to
reinterpret into HTML. I have a feeling I'm making this problem more
complex than it needs to be.

The beginning and ending markup are the same. So <I>text<I> would
become <I>text</I> in HTML. A single line can have more than one of
these italicized words, for example.

So this doesn't work:
s/(\<I\> )(.*)(\<I\> )/$1$2\<\/I\>/g
because .* matches anything in between and only the last one is
considered. <I>text<I>blahblah<I>text<I> becomes
<I>textblahblahtext</I>.

I would like it to match every second <I> and replace it with </I>, no
matter how many occurrences in a line. Am I on the right track with
something like this?:
for $line (@descriptions) {
....
$count = 0;
$line =~
s{
\<I\>
}{
if (++$count==2){
\<\/I\>
}
else {
\<I\>
}
}gex;
....
}
which gives me the following error:
Unterminated <> operator at program.pl line 48.
Line 48 is }{
So I have the syntax wrong, at the very least.

Should I be thinking more in terms of some sort of lookahead?

Any hints as to how to approach this will be appreciated.

-Chris Cosner

Paul Lalli

2006-01-23, 6:55 pm

Chris Cosner wrote:
> I have some data coming into a system with markup that I need to
> reinterpret into HTML. I have a feeling I'm making this problem more
> complex than it needs to be.


You are. :-)

> The beginning and ending markup are the same. So <I>text<I> would
> become <I>text</I> in HTML. A single line can have more than one of
> these italicized words, for example.
>
> So this doesn't work:
> s/(\<I\> )(.*)(\<I\> )/$1$2\<\/I\>/g


< and > are not special in a regular expression. No need to escape
them. Further, there is no reason to make your life difficult by
choosing a delimiter which actually appears in your replacement string,
causing you to escape that as well:

s#(<I> )(.*)(<I> )#$1$2</I>/g;

> because .* matches anything in between and only the last one is
> considered. <I>text<I>blahblah<I>text<I> becomes
> <I>textblahblahtext</I>.


Right. So change that behavior. Have you read
perldoc perlretut
yet? That is a good place to start. Search for "non-greedy".
Basically, you can change the behavior of the * quantifier from "as
much as possible" to "as little as possible" by appending a ?, like so:

s#(<I> )(.*?)(<I> )#$1$2</I>/g;

> I would like it to match every second <I> and replace it with </I>, no
> matter how many occurrences in a line. Am I on the right track with
> something like this?:
> for $line (@descriptions) {
> ...
> $count = 0;
> $line =~
> s{
> \<I\>
> }{
> if (++$count==2){
> \<\/I\>
> }
> else {
> \<I\>
> }
> }gex;
> ...
> }
> which gives me the following error:
> Unterminated <> operator at program.pl line 48.
> Line 48 is }{
> So I have the syntax wrong, at the very least.


Yes. You also have a logic error. The syntax error is the fact that
when you use the /e modifier, your replacement "string" becomes code.
and
\<\/I\>
is trying to be interpreted as executable code. What you *want* is for
a string containing those characters, rather than the code represented
by those characters. Replace that sequence with:
"</I>"
(including the quotes) and the syntax error will disappear.

The logic error is that you are *only* replacing the second <I>, not
the second, fourth, sixth, etc. Change
++$count==2
to
$count++ % 2

To test if the current value of $count divides evenly by 2, and then
increment it.

However, as explained above, none of this large block is needed. Just
take your existing regexp, and make the quantifier non-greedy.

> Should I be thinking more in terms of some sort of lookahead?


No, that would be even *more* complex than it needs to be... ;-)

Paul Lalli

Paul Lalli

2006-01-23, 6:55 pm


Paul Lalli wrote:
> Chris Cosner wrote:
>
> You are. :-)
>
>
> < and > are not special in a regular expression. No need to escape
> them. Further, there is no reason to make your life difficult by
> choosing a delimiter which actually appears in your replacement string,
> causing you to escape that as well:
>
> s#(<I> )(.*)(<I> )#$1$2</I>/g;


Oy. Did I really type that?

s#(<I> )(.*)(<I> )#$1$2</I>#g;

is what I meant of course, and then later...

> Basically, you can change the behavior of the * quantifier from "as
> much as possible" to "as little as possible" by appending a ?, like so:
>
> s#(<I> )(.*?)(<I> )#$1$2</I>/g;


s#(<I> )(.*?)(<I> )#$1$2</I>#g;

Sorry about that.

Paul Lalli

Dr.Ruud

2006-01-23, 6:55 pm

Chris Cosner schreef:

> The beginning and ending markup are the same. So <I>text<I> would
> become <I>text</I> in HTML. A single line can have more than one of
> these italicized words, for example.


If all in a single line:

s~(<I>.*?)<I>~$1</I>~g

--
Grtz, Ruud
Chris Cosner

2006-01-24, 3:55 am

>>
Bingo! Thanks for the replies.

>
> Just use the "non-greedy" form of "*":
>
> s{(\<I\> )(.*?)(\<I\> )}{$1$2\<\/I\>}g
>
> should do what you want.
>


Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com