For Programmers: Free Programming Magazines  


Home > Archive > PERL Miscellaneous > April 2005 > matching multiple occurrences in the same line









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author matching multiple occurrences in the same line
michal.shmueli@gmail.com

2005-04-27, 3:58 pm

Hi,
I have a problem with pattern matching:
i have one very long line, and i'm looking of all occurrences of this
string : <td class="year" rowspan="2">
in the line. Actually, after each iccurrence of this string there is a
number which i need to parse and print, for example i need to extract
345 from this:
<td class="year" rowspan="2">345

i wrote the follow:

while(<FILE> ){
chomp($_);
if (~ m/<td class="year" rowspan="2">(\d+).+</) {print OUT "\t$1";}
}

but it just give me the first occurrence of the pattern.
what's wrong in this?

thanks a lot for your help

Michal

JayEs

2005-04-27, 3:58 pm

> string : <td class="year" rowspan="2">

Since that looks a lot like HTML, why not use HTML::TokeParser and save
yourself from the regex hassles?

JS


Gunnar Hjalmarsson

2005-04-27, 8:57 pm

michal.shmueli@gmail.com wrote:
> i have one very long line, and i'm looking of all occurrences of this
> string : <td class="year" rowspan="2">
> in the line. Actually, after each iccurrence of this string there is a
> number which i need to parse and print, for example i need to extract
> 345 from this:
> <td class="year" rowspan="2">345
>
> i wrote the follow:
>
> while(<FILE> ){
> chomp($_);


Why do you chomp()?

> if (~ m/<td class="year" rowspan="2">(\d+).+</) {print OUT "\t$1";}

------------^
What's that?

Use while instead of if, and add the /g modifier. Furthermore, the

.+<

part is not only redundant, but since regular expressions are greedy by
default, also that part prevents you from finding more than one occurrence.

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
Gunnar Hjalmarsson

2005-04-27, 8:57 pm

JayEs wrote:
>
> Since that looks a lot like HTML, why not use HTML::TokeParser and save
> yourself from the regex hassles?


The OP is looking for *all* occurrences of that fixed string. The fact
that it's HTML does not make the OP's problem a HTML parsing problem
that 'requires' a parsing module. It can easily be handled using a
regex, even if the string in question starts with '<' and ends with '>'. ;-)

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
michal.shmueli@gmail.com

2005-04-27, 8:57 pm


JayEs wrote:
>
> Since that looks a lot like HTML, why not use HTML::TokeParser and

save
> yourself from the regex hassles?
>
> JS


i've tried the following code but it's not working...

use HTML::TokeParser;

$file="res.html"
$p = HTML::TokeParser->new($file);
if ($p->get_tag("td")) {
my $td = $p->get_trimmed_text;
print "Td: $td\n";
}

Am i missing something?

thanks again

michal.shmueli@gmail.com

2005-04-27, 8:57 pm

yap.. sorry. i've changed a bit and it's working properly...

thanks

michal.shmueli@gmail.com

2005-04-27, 8:57 pm

Actually, i don't want to use the html parser- it's ok, but i need to
parse more patterns which are not part of the table. so anyway i tried
the follow as you suggested:
while(<FILE> ){
while(~ gm/<td class="year" rowspan="2">(\d+)./) {print OUT "\t$1";}

now i get some compliation errors.
the original line (part) is : <td class="year" rowspan="2">2004</td><td
class="veh" rowspan="2"><a

many thanks

JayEs

2005-04-27, 8:57 pm


>
> The OP is looking for *all* occurrences of that fixed string. The fact
> that it's HTML does not make the OP's problem a HTML parsing problem


<SNIP>

Entirely correct! I simply offered another solution for the same problem.
Tim Toady? ;-)
The fact that the OP is looking for a value (ALL of them) that is prefixed
with the same HTML tag, makes TokeParser a good alternative IMHO. Later the
OP states that he can't use TokeParser because he needs to do more string
matching on non-HTML, but I didn't have that info at the time...

Anyway, both suggestions work on the original problem :-)

JS


Gunnar Hjalmarsson

2005-04-27, 8:57 pm

michal.shmueli@gmail.com wrote:
> Actually, i don't want to use the html parser- it's ok, but i need to
> parse more patterns which are not part of the table.


Not sure I follow you. The more complex the task is, the more likely a
parsing module is suitable.

> so anyway i tried the follow as you suggested:
> while(<FILE> ){
> while(~ gm/<td class="year" rowspan="2">(\d+)./) {print OUT "\t$1";}

----------^-^------------------------------------^

That's not what I suggested.
- The '~' character is still there. (I suppose you don't know what
it's supposed to do.)
- Modifiers shall be appended, not prepended, to the regex.
- The dot is still redundant.

For a regex to be a suitable alternative to a module (in certain cases),
you need to know how regexes work. It's obvious that you need to read up
on it:

perldoc perlrequick
perldoc perlretut
perldoc perlre

Good luck!

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
Hendrik Maryns

2005-04-27, 8:57 pm

michal.shmueli@gmail.com uitte de volgende tekst op 27/04/2005 21:17:
> Actually, i don't want to use the html parser- it's ok, but i need to
> parse more patterns which are not part of the table. so anyway i tried
> the follow as you suggested:
> while(<FILE> ){
> while(~ gm/<td class="year" rowspan="2">(\d+)./) {print OUT "\t$1";}
>
> now i get some compliation errors.
> the original line (part) is : <td class="year" rowspan="2">2004</td><td
> class="veh" rowspan="2"><a


The g should come at the end:

while(<FILE> ){
while(~ m/<td class="year" rowspan="2">(\d+)./g) {print OUT "\t$1";}

Furthermore, I don't see what this ~ is doing there, and you don't need
the final dot:

while(m/<td class="year" rowspan="2">(\d+)/g) {print OUT "\t$1"}

or, morre perlish

print OUT "\t$1" while(m/<td class="year" rowspan="2">(\d+)/g);

HTH, H.

--
Hendrik Maryns

Interesting websites:
www.lieverleven.be (I cooperate)
www.eu04.com European Referendum Campaign
aouw.org The Art Of Urban Warfare
Tad McClellan

2005-04-27, 8:57 pm

michal.shmueli@gmail.com <michal.shmueli@gmail.com> wrote:

> so anyway i tried
> the follow as you suggested:
> while(<FILE> ){
> while(~ gm/<td class="year" rowspan="2">(\d+)./) {print OUT "\t$1";}

^
^

That is the bitwise negation operator.

Why are you using that there?

What is the point of the final dot in your pattern?


> now i get some compliation errors.



Well yes, because that is not the change that was suggested.

It was suggested to add a "g" option to the pattern match operator.

See perlop.pod for how to add pattern match options.


--
Tad McClellan SGML consulting
tadmc@augustmail.com Perl programming
Fort Worth, Texas
michal.shmueli@gmail.com

2005-04-27, 8:57 pm

thanks for all the help.
it seems to work fine. what i need is search for 3 different patterns
which appears in this line many times- they always appear in this
order. Moreover the second one (listing_id may appear twice)

so the code i wrote:

while(<FILE> ){
while((m/<td class="year" rowspan="2">(\d+)/g) ||
((m/listing_id=(\d+)/g) ||
((m/<td class="price">(\S+)/g) ||) {print OUT "\t$1";}


but it seems to be an infinite loop

any ideas?

Michal


Hendrik Maryns wrote:
> michal.shmueli@gmail.com uitte de volgende tekst op 27/04/2005 21:17:
to[color=darkred]
tried[color=darkred]
"\t$1";}[color=darkred]
rowspan="2">2004</td><td[color=darkred]
>
> The g should come at the end:
>
> while(<FILE> ){
> while(~ m/<td class="year" rowspan="2">(\d+)./g) {print OUT

"\t$1";}
>
> Furthermore, I don't see what this ~ is doing there, and you don't

need
> the final dot:
>
> while(m/<td class="year" rowspan="2">(\d+)/g) {print OUT "\t$1"}
>
> or, morre perlish
>
> print OUT "\t$1" while(m/<td class="year" rowspan="2">(\d+)/g);
>
> HTH, H.
>
> --
> Hendrik Maryns
>
> Interesting websites:
> www.lieverleven.be (I cooperate)
> www.eu04.com European Referendum Campaign
> aouw.org The Art Of Urban Warfare


Gunnar Hjalmarsson

2005-04-28, 3:59 am

michal.shmueli@gmail.com wrote:
> what i need is search for 3 different patterns
> which appears in this line many times- they always appear in this
> order. Moreover the second one (listing_id may appear twice)
>
> so the code i wrote:
>
> while(<FILE> ){
> while((m/<td class="year" rowspan="2">(\d+)/g) ||
> ((m/listing_id=(\d+)/g) ||
> ((m/<td class="price">(\S+)/g) ||) {print OUT "\t$1";}
>
>
> but it seems to be an infinite loop


No it's not, since it doesn't even complile. Please copy and paste code
that you post; don't re-type it!

> any ideas?


Your approach seems odd to me, and I prefer not to comment on it. This
is an alternative approach:

my $s1 = '<td\s+class="year"\s+rowspan="2">(\d+)';
my $s2 = 'listing_id=(\d+)';
my $s3 = '<td\s+class="price">(\d+(?:\.\d+)?)';

my $pattern = qr($s1|$s2|$s3);

my $data = do { local $/; <FILE> };

print "\t$+" while $data =~ /$pattern/g;

It has a few advantages compared to what you were trying to do, but
there are most certainly details, that only you know about, requiring
further tweaking. For instance, the pattern for price:

\d+(?:\.\d+)?

may or may not be correct in this case. Maybe you'd use Regex::Common's
method for matching numbers instead. In any case, I doubt that just \S+
will give you what you want.

Please use the docs if there are details in the above suggestion that
you don't understand.

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com