Home > Archive > PERL CGI Beginners > January 2006 > regex nth match of string, perl 5.8.5
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
regex nth match of string, perl 5.8.5
|
|
| Chris Cosner 2006-01-23, 6:55 pm |
| Hello,
I have some data coming into a system with markup that I need to
reinterpret into HTML. I have a feeling I'm making this problem more
complex than it needs to be.
The beginning and ending markup are the same. So <I>text<I> would
become <I>text</I> in HTML. A single line can have more than one of
these italicized words, for example.
So this doesn't work:
s/(\<I\> )(.*)(\<I\> )/$1$2\<\/I\>/g
because .* matches anything in between and only the last one is
considered. <I>text<I>blahblah<I>text<I> becomes
<I>textblahblahtext</I>.
I would like it to match every second <I> and replace it with </I>, no
matter how many occurrences in a line. Am I on the right track with
something like this?:
for $line (@descriptions) {
....
$count = 0;
$line =~
s{
\<I\>
}{
if (++$count==2){
\<\/I\>
}
else {
\<I\>
}
}gex;
....
}
which gives me the following error:
Unterminated <> operator at program.pl line 48.
Line 48 is }{
So I have the syntax wrong, at the very least.
Should I be thinking more in terms of some sort of lookahead?
Any hints as to how to approach this will be appreciated.
-Chris Cosner
| |
| Paul Lalli 2006-01-23, 6:55 pm |
| Chris Cosner wrote:
> I have some data coming into a system with markup that I need to
> reinterpret into HTML. I have a feeling I'm making this problem more
> complex than it needs to be.
You are. :-)
> The beginning and ending markup are the same. So <I>text<I> would
> become <I>text</I> in HTML. A single line can have more than one of
> these italicized words, for example.
>
> So this doesn't work:
> s/(\<I\> )(.*)(\<I\> )/$1$2\<\/I\>/g
< and > are not special in a regular expression. No need to escape
them. Further, there is no reason to make your life difficult by
choosing a delimiter which actually appears in your replacement string,
causing you to escape that as well:
s#(<I> )(.*)(<I> )#$1$2</I>/g;
> because .* matches anything in between and only the last one is
> considered. <I>text<I>blahblah<I>text<I> becomes
> <I>textblahblahtext</I>.
Right. So change that behavior. Have you read
perldoc perlretut
yet? That is a good place to start. Search for "non-greedy".
Basically, you can change the behavior of the * quantifier from "as
much as possible" to "as little as possible" by appending a ?, like so:
s#(<I> )(.*?)(<I> )#$1$2</I>/g;
> I would like it to match every second <I> and replace it with </I>, no
> matter how many occurrences in a line. Am I on the right track with
> something like this?:
> for $line (@descriptions) {
> ...
> $count = 0;
> $line =~
> s{
> \<I\>
> }{
> if (++$count==2){
> \<\/I\>
> }
> else {
> \<I\>
> }
> }gex;
> ...
> }
> which gives me the following error:
> Unterminated <> operator at program.pl line 48.
> Line 48 is }{
> So I have the syntax wrong, at the very least.
Yes. You also have a logic error. The syntax error is the fact that
when you use the /e modifier, your replacement "string" becomes code.
and
\<\/I\>
is trying to be interpreted as executable code. What you *want* is for
a string containing those characters, rather than the code represented
by those characters. Replace that sequence with:
"</I>"
(including the quotes) and the syntax error will disappear.
The logic error is that you are *only* replacing the second <I>, not
the second, fourth, sixth, etc. Change
++$count==2
to
$count++ % 2
To test if the current value of $count divides evenly by 2, and then
increment it.
However, as explained above, none of this large block is needed. Just
take your existing regexp, and make the quantifier non-greedy.
> Should I be thinking more in terms of some sort of lookahead?
No, that would be even *more* complex than it needs to be... ;-)
Paul Lalli
| |
| Paul Lalli 2006-01-23, 6:55 pm |
|
Paul Lalli wrote:
> Chris Cosner wrote:
>
> You are. :-)
>
>
> < and > are not special in a regular expression. No need to escape
> them. Further, there is no reason to make your life difficult by
> choosing a delimiter which actually appears in your replacement string,
> causing you to escape that as well:
>
> s#(<I> )(.*)(<I> )#$1$2</I>/g;
Oy. Did I really type that?
s#(<I> )(.*)(<I> )#$1$2</I>#g;
is what I meant of course, and then later...
> Basically, you can change the behavior of the * quantifier from "as
> much as possible" to "as little as possible" by appending a ?, like so:
>
> s#(<I> )(.*?)(<I> )#$1$2</I>/g;
s#(<I> )(.*?)(<I> )#$1$2</I>#g;
Sorry about that.
Paul Lalli
| |
| Dr.Ruud 2006-01-23, 6:55 pm |
| Chris Cosner schreef:
> The beginning and ending markup are the same. So <I>text<I> would
> become <I>text</I> in HTML. A single line can have more than one of
> these italicized words, for example.
If all in a single line:
s~(<I>.*?)<I>~$1</I>~g
--
Grtz, Ruud
| |
| Chris Cosner 2006-01-24, 3:55 am |
| >>
Bingo! Thanks for the replies.
>
> Just use the "non-greedy" form of "*":
>
> s{(\<I\> )(.*?)(\<I\> )}{$1$2\<\/I\>}g
>
> should do what you want.
>
|
|
|
|
|