Home > Archive > PERL Beginners > May 2004 > Regular Expresssion - Matching over multiple lines
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Regular Expresssion - Matching over multiple lines
|
|
| Roman Hanousek 2004-05-18, 1:30 am |
| Hi All
I have bunch of files that contain code like this:
What I am trying to do is match <ps:img and this /> then check that this
piece of code contains a alt= tag.
<ps:img page="/images/portal/arrow_down.gif" border="0"
width="9" height="6"
alt="${string['lists.list.sort.ascending.alt']}"
title="${string['lists.list.sort.ascending.alt']}" />
And if it doen't print the lines where it's missing to screen or file.
Cheers any help appreciated.
| |
| Andrew Gaffney 2004-05-18, 1:30 am |
| Roman Hanousek wrote:
> Hi All
>
> I have bunch of files that contain code like this:
>
> What I am trying to do is match <ps:img and this /> then check that this
> piece of code contains a alt= tag.
>
>
> <ps:img page="/images/portal/arrow_down.gif" border="0"
> width="9" height="6"
> alt="${string['lists.list.sort.ascending.alt']}"
> title="${string['lists.list.sort.ascending.alt']}" />
>
>
> And if it doen't print the lines where it's missing to screen or file.
while($input =~ |<ps:img .+(alt\s*=\s*\".+\")?.+/>|sgc) {
print "Missing ALT\n" if(! defined $1);
}
That doesn't give you line numbers, but it does give you an idea of where to start.
--
Andrew Gaffney
Network Administrator
Skyline Aeronautics, LLC.
636-357-1548
| |
| James Edward Gray II 2004-05-18, 10:33 am |
| On May 17, 2004, at 11:16 PM, Andrew Gaffney wrote:
> Roman Hanousek wrote:
>
> while($input =~ |<ps:img .+(alt\s*=\s*\".+\")?.+/>|sgc) {
> print "Missing ALT\n" if(! defined $1);
> }
>
> That doesn't give you line numbers, but it does give you an idea of
> where to start.
Be careful. Matching HTML-style markup with regexen is surprisingly
tricky. I suspect the version above would not work well in many
instances. Remember .+ is super greedy, more so since you allow it to
swallow \n as well. The above pattern should match the first <ps:img,
swallow the rest of ALL the data and then backup until it can find a
/>. That's probably not going to work out to well, in many cases.
Depending on how much is known about the tags, you might have more luck
with a pattern like:
m!<ps:img([^>]+)/>!g
From there it's pretty easy to check $1 for an alt="...", or whatever.
Hope that helps.
James
| |
| Andrew Gaffney 2004-05-18, 11:30 am |
| James Edward Gray II wrote:
> On May 17, 2004, at 11:16 PM, Andrew Gaffney wrote:
>
>
>
> Be careful. Matching HTML-style markup with regexen is surprisingly
> tricky. I suspect the version above would not work well in many
> instances. Remember .+ is super greedy, more so since you allow it to
> swallow \n as well. The above pattern should match the first <ps:img,
> swallow the rest of ALL the data and then backup until it can find a
> />. That's probably not going to work out to well, in many cases.
>
> Depending on how much is known about the tags, you might have more luck
> with a pattern like:
>
> m!<ps:img([^>]+)/>!g
>
> From there it's pretty easy to check $1 for an alt="...", or whatever.
>
> Hope that helps.
Doesn't the 'gc' modified make the whole think not as greedy? As a side effect
of continuation, doesn't it try to match as many times as possible?
--
Andrew Gaffney
Network Administrator
Skyline Aeronautics, LLC.
636-357-1548
| |
| James Edward Gray II 2004-05-18, 12:31 pm |
| On May 18, 2004, at 9:30 AM, Andrew Gaffney wrote:
> Doesn't the 'gc' modified make the whole think not as greedy? As a
> side effect of continuation, doesn't it try to match as many times as
> possible?
I'm not familiar with this, but my gut reaction is no. Perhaps on of
the Regex experts can clear that up for us...
James
| |
| Jeff 'Japhy' Pinyan 2004-05-18, 1:30 pm |
| On May 18, James Edward Gray II said:
>On May 18, 2004, at 9:30 AM, Andrew Gaffney wrote:
>
>
>I'm not familiar with this, but my gut reaction is no. Perhaps on of
>the Regex experts can clear that up for us...
Correct. No modifier to a regex changes the greediness of the quantifiers
in the regex. All the /g modifier does is say:
1. if the regex is in list context, match, and then try to match again
following the first match, etc., until you stop
2. if the regex is in scalar context, match and return, but remember
where we left off -- the next time this regex is called with the /g
modifier, we will pick up where stopped. this position can also be
used with the \G anchor.
Here are examples:
my $str = "japhy knows regexes";
@all_letters = $str =~ /\w/g;
# @all_letters contains 17 elements: j,a,p,h,y,k,n,o,etc.
# and before you ask, NO, I DON'T need parens around \w in there
while ($str =~ /(\w+)/g) {
print "Got: '$1'\n"; # Got: japhy; Got: knows; Got: regexes
}
if ($str =~ /(\w\w)/g) {
print "Two letters: '$1'\n"; # 'ja'
if ($str =~ /\G(.{5})/) {
print "Next five characters: '$1'\n"; # 'phy k'
}
}
Once a /g match fails, \G is cleared (\G is linked to the pos() function;
that is, whatever pos($str) is equal to is the location in $str that \G
anchors to).
*ALL* that the /c modifier does (and it only matters when used with the /g
modifier) is tell the regex engine NOT to clear \G or pos() when a match
fails. Here's a method called the inchworm:
print "Got '$1'\n" while
$str =~ /\G"([^"]*)"\s*/gc or
$str =~ /\G'([^']*)'\s*/gc or
$str =~ /\G(\S+)/gc;
This allows us to use $1 no matter which regex matches, and because all
three regexes have the /gc modifier, when the first one fails, it'll try
the second one, AT THE SAME LOCATION.
--
Jeff "japhy" Pinyan japhy@pobox.com http://www.pobox.com/~japhy/
RPI Acacia brother #734 http://www.perlmonks.org/ http://www.cpan.org/
CPAN ID: PINYAN [Need a programmer? If you like my work, let me know.]
<stu> what does y/// stand for? <tenderpuss> why, yansliterate of course.
|
|
|
|
|