Code Comments
Programming Forum and web based access to our favorite programming groups.Hi,
I have a problem with pattern matching:
i have one very long line, and i'm looking of all occurrences of this
string : <td class="year" rowspan="2">
in the line. Actually, after each iccurrence of this string there is a
number which i need to parse and print, for example i need to extract
345 from this:
<td class="year" rowspan="2">345
i wrote the follow:
while(<FILE> ){
chomp($_);
if (~ m/<td class="year" rowspan="2">(\d+).+</) {print OUT "\t$1";}
}
but it just give me the first occurrence of the pattern.
what's wrong in this?
thanks a lot for your help
Michal
Post Follow-up to this message> string : <td class="year" rowspan="2"> Since that looks a lot like HTML, why not use HTML::TokeParser and save yourself from the regex hassles? JS
Post Follow-up to this messagemichal.shmueli@gmail.com wrote:
> i have one very long line, and i'm looking of all occurrences of this
> string : <td class="year" rowspan="2">
> in the line. Actually, after each iccurrence of this string there is a
> number which i need to parse and print, for example i need to extract
> 345 from this:
> <td class="year" rowspan="2">345
>
> i wrote the follow:
>
> while(<FILE> ){
> chomp($_);
Why do you chomp()?
> if (~ m/<td class="year" rowspan="2">(\d+).+</) {print OUT "\t$1";}
------------^
What's that?
Use while instead of if, and add the /g modifier. Furthermore, the
.+<
part is not only redundant, but since regular expressions are greedy by
default, also that part prevents you from finding more than one occurrence.
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
Post Follow-up to this messageJayEs wrote: > > Since that looks a lot like HTML, why not use HTML::TokeParser and save > yourself from the regex hassles? The OP is looking for *all* occurrences of that fixed string. The fact that it's HTML does not make the OP's problem a HTML parsing problem that 'requires' a parsing module. It can easily be handled using a regex, even if the string in question starts with '<' and ends with '>'. ;-) -- Gunnar Hjalmarsson Email: http://www.gunnar.cc/cgi-bin/contact.pl
Post Follow-up to this message
JayEs wrote:
>
> Since that looks a lot like HTML, why not use HTML::TokeParser and
save
> yourself from the regex hassles?
>
> JS
i've tried the following code but it's not working...
use HTML::TokeParser;
$file="res.html"
$p = HTML::TokeParser->new($file);
if ($p->get_tag("td")) {
my $td = $p->get_trimmed_text;
print "Td: $td\n";
}
Am i missing something?
thanks again
Post Follow-up to this messageyap.. sorry. i've changed a bit and it's working properly... thanks
Post Follow-up to this messageActually, i don't want to use the html parser- it's ok, but i need to
parse more patterns which are not part of the table. so anyway i tried
the follow as you suggested:
while(<FILE> ){
while(~ gm/<td class="year" rowspan="2">(\d+)./) {print OUT "\t$1";}
now i get some compliation errors.
the original line (part) is : <td class="year" rowspan="2">2004</td><td
class="veh" rowspan="2"><a
many thanks
Post Follow-up to this message> > The OP is looking for *all* occurrences of that fixed string. The fact > that it's HTML does not make the OP's problem a HTML parsing problem <SNIP> Entirely correct! I simply offered another solution for the same problem. Tim Toady? ;-) The fact that the OP is looking for a value (ALL of them) that is prefixed with the same HTML tag, makes TokeParser a good alternative IMHO. Later the OP states that he can't use TokeParser because he needs to do more string matching on non-HTML, but I didn't have that info at the time... Anyway, both suggestions work on the original problem :-) JS
Post Follow-up to this messagemichal.shmueli@gmail.com wrote:
> Actually, i don't want to use the html parser- it's ok, but i need to
> parse more patterns which are not part of the table.
Not sure I follow you. The more complex the task is, the more likely a
parsing module is suitable.
> so anyway i tried the follow as you suggested:
> while(<FILE> ){
> while(~ gm/<td class="year" rowspan="2">(\d+)./) {print OUT "\t$1";}
----------^-^------------------------------------^
That's not what I suggested.
- The '~' character is still there. (I suppose you don't know what
it's supposed to do.)
- Modifiers shall be appended, not prepended, to the regex.
- The dot is still redundant.
For a regex to be a suitable alternative to a module (in certain cases),
you need to know how regexes work. It's obvious that you need to read up
on it:
perldoc perlrequick
perldoc perlretut
perldoc perlre
Good luck!
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
Post Follow-up to this messagemichal.shmueli@gmail.com uitte de volgende tekst op 27/04/2005 21:17:
> Actually, i don't want to use the html parser- it's ok, but i need to
> parse more patterns which are not part of the table. so anyway i tried
> the follow as you suggested:
> while(<FILE> ){
> while(~ gm/<td class="year" rowspan="2">(\d+)./) {print OUT "\t$1";}
>
> now i get some compliation errors.
> the original line (part) is : <td class="year" rowspan="2">2004</td><td
> class="veh" rowspan="2"><a
The g should come at the end:
while(<FILE> ){
while(~ m/<td class="year" rowspan="2">(\d+)./g) {print OUT "\t$1";}
Furthermore, I don't see what this ~ is doing there, and you don't need
the final dot:
while(m/<td class="year" rowspan="2">(\d+)/g) {print OUT "\t$1"}
or, morre perlish
print OUT "\t$1" while(m/<td class="year" rowspan="2">(\d+)/g);
HTH, H.
--
Hendrik Maryns
Interesting websites:
www.lieverleven.be (I cooperate)
www.eu04.com European Referendum Campaign
aouw.org The Art Of Urban Warfare
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread
Powered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.