| Author |
regular expression pb. with tags
|
|
| steeve_dun@SoftHome.net 2006-09-26, 8:03 am |
| Hi,
I want to make some pattern replacement. ie to delete every thing
that's between 2 tags.
For example for
1<tag> 2</tag>3
x<tag> a<tag> b </tag> c</tag>z
I want to get
1 3
x z
But I have a problem with embeded tags.
I've tried :
$text =~ s/\<tag\>(.*?)\<\/tag\>//sg;
but it doens't work for embeded tags. It gives:
13
x c</tag>z
Is there a way to deal with this?
Thank you
-steeve
| |
| David Squire 2006-09-26, 8:03 am |
| steeve_dun@SoftHome.net wrote:
> Hi,
> I want to make some pattern replacement. ie to delete every thing
> that's between 2 tags.
> For example for
>
> 1<tag> 2</tag>3
> x<tag> a<tag> b </tag> c</tag>z
>
> I want to get
>
> 1 3
> x z
>
> But I have a problem with embeded tags.
> I've tried :
> $text =~ s/\<tag\>(.*?)\<\/tag\>//sg;
> but it doens't work for embeded tags. It gives:
> 13
> x c</tag>z
>
> Is there a way to deal with this?
Yep. Don't try to use regular expressions to parse XML. Use a module
that understands XML. Go to CPAN and you will find many.
DS
| |
| anno4000@radom.zrz.tu-berlin.de 2006-09-26, 8:03 am |
| <steeve_dun@SoftHome.net> wrote in comp.lang.perl.misc:
> Hi,
> I want to make some pattern replacement. ie to delete every thing
> that's between 2 tags.
> For example for
>
> 1<tag> 2</tag>3
> x<tag> a<tag> b </tag> c</tag>z
>
> I want to get
>
> 1 3
> x z
>
> But I have a problem with embeded tags.
> I've tried :
> $text =~ s/\<tag\>(.*?)\<\/tag\>//sg;
> but it doens't work for embeded tags. It gives:
> 13
> x c</tag>z
>
> Is there a way to deal with this?
Not using regular expressions directly. Use one of the HTML-parsing
modules from CPAN.
Anno
| |
| Xicheng Jia 2006-09-26, 6:59 pm |
| steeve_dun@SoftHome.net wrote:
> Hi,
> I want to make some pattern replacement. ie to delete every thing
> that's between 2 tags.
> For example for
>
> 1<tag> 2</tag>3
> x<tag> a<tag> b </tag> c</tag>z
>
> I want to get
>
> 1 3
> x z
>
> But I have a problem with embeded tags.
> I've tried :
> $text =~ s/\<tag\>(.*?)\<\/tag\>//sg;
> but it doens't work for embeded tags. It gives:
> 13
> x c</tag>z
>
> Is there a way to deal with this?
Since you are using Perl, and XML is quite well formated, you may try
something like:
my $ptn;
$ptn = qr(<tag>(?:(??{$ptn})|.)*?</tag> )s;
$line =~ s/$ptn//g;
I am not encouraging you using regexes at work. But in case of some
small programs, using regexes might be much faster/easier if you know
what you do.
Regards,
Xicheng
| |
| Ted Zlatanov 2006-09-26, 6:59 pm |
| On 26 Sep 2006, steeve_dun@softhome.net wrote:
> I want to make some pattern replacement. ie to delete every thing
> that's between 2 tags.
> For example for
>
> 1<tag> 2</tag>3
> x<tag> a<tag> b </tag> c</tag>z
>
> I want to get
>
> 1 3
> x z
>
> But I have a problem with embeded tags.
> I've tried :
> $text =~ s/\<tag\>(.*?)\<\/tag\>//sg;
> but it doens't work for embeded tags. It gives:
> 13
> x c</tag>z
>
> Is there a way to deal with this?
For the first example, you're getting exactly what you wanted ("13").
Look at your input data.
For the second example, your requirements are not good. You don't say
whether you want to replace the outermost tags (in which case a regex
would work) or you want to balance tags. For outermost tag
replacement, use
$text =~ s/\<tag\>(.*)\<\/tag\>//sg;
but note that this will also replace "<tag>a</tag> extra <tag>b</tag>"
with "" and not " extra " as you may expect.
My guess is that you do want to balance tags, and you can use
Text::Balanced for that (especially if your text is not valid XML or
even SGML). If you are doing SGML/HTML/XML/etc. tagged formats then
you should search CPAN for the appropriate parser, as others have
suggested. Look at "perldoc -q html" as well.
Ted
| |
| steeve_dun@SoftHome.net 2006-09-27, 4:00 am |
| Thank you all
-steve
|
|
|
|