Home > Archive > PERL Miscellaneous > April 2005 > matching multiple occurrences in the same line
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
matching multiple occurrences in the same line
|
|
| michal.shmueli@gmail.com 2005-04-27, 3:58 pm |
| Hi,
I have a problem with pattern matching:
i have one very long line, and i'm looking of all occurrences of this
string : <td class="year" rowspan="2">
in the line. Actually, after each iccurrence of this string there is a
number which i need to parse and print, for example i need to extract
345 from this:
<td class="year" rowspan="2">345
i wrote the follow:
while(<FILE> ){
chomp($_);
if (~ m/<td class="year" rowspan="2">(\d+).+</) {print OUT "\t$1";}
}
but it just give me the first occurrence of the pattern.
what's wrong in this?
thanks a lot for your help
Michal
| |
|
| > string : <td class="year" rowspan="2">
Since that looks a lot like HTML, why not use HTML::TokeParser and save
yourself from the regex hassles?
JS
| |
| Gunnar Hjalmarsson 2005-04-27, 8:57 pm |
| michal.shmueli@gmail.com wrote:
> i have one very long line, and i'm looking of all occurrences of this
> string : <td class="year" rowspan="2">
> in the line. Actually, after each iccurrence of this string there is a
> number which i need to parse and print, for example i need to extract
> 345 from this:
> <td class="year" rowspan="2">345
>
> i wrote the follow:
>
> while(<FILE> ){
> chomp($_);
Why do you chomp()?
> if (~ m/<td class="year" rowspan="2">(\d+).+</) {print OUT "\t$1";}
------------^
What's that?
Use while instead of if, and add the /g modifier. Furthermore, the
.+<
part is not only redundant, but since regular expressions are greedy by
default, also that part prevents you from finding more than one occurrence.
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
| |
| Gunnar Hjalmarsson 2005-04-27, 8:57 pm |
| JayEs wrote:
>
> Since that looks a lot like HTML, why not use HTML::TokeParser and save
> yourself from the regex hassles?
The OP is looking for *all* occurrences of that fixed string. The fact
that it's HTML does not make the OP's problem a HTML parsing problem
that 'requires' a parsing module. It can easily be handled using a
regex, even if the string in question starts with '<' and ends with '>'. ;-)
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
| |
| michal.shmueli@gmail.com 2005-04-27, 8:57 pm |
|
JayEs wrote:
>
> Since that looks a lot like HTML, why not use HTML::TokeParser and
save
> yourself from the regex hassles?
>
> JS
i've tried the following code but it's not working...
use HTML::TokeParser;
$file="res.html"
$p = HTML::TokeParser->new($file);
if ($p->get_tag("td")) {
my $td = $p->get_trimmed_text;
print "Td: $td\n";
}
Am i missing something?
thanks again
| |
| michal.shmueli@gmail.com 2005-04-27, 8:57 pm |
| yap.. sorry. i've changed a bit and it's working properly...
thanks
| |
| michal.shmueli@gmail.com 2005-04-27, 8:57 pm |
| Actually, i don't want to use the html parser- it's ok, but i need to
parse more patterns which are not part of the table. so anyway i tried
the follow as you suggested:
while(<FILE> ){
while(~ gm/<td class="year" rowspan="2">(\d+)./) {print OUT "\t$1";}
now i get some compliation errors.
the original line (part) is : <td class="year" rowspan="2">2004</td><td
class="veh" rowspan="2"><a
many thanks
| |
|
|
>
> The OP is looking for *all* occurrences of that fixed string. The fact
> that it's HTML does not make the OP's problem a HTML parsing problem
<SNIP>
Entirely correct! I simply offered another solution for the same problem.
Tim Toady? ;-)
The fact that the OP is looking for a value (ALL of them) that is prefixed
with the same HTML tag, makes TokeParser a good alternative IMHO. Later the
OP states that he can't use TokeParser because he needs to do more string
matching on non-HTML, but I didn't have that info at the time...
Anyway, both suggestions work on the original problem :-)
JS
| |
| Gunnar Hjalmarsson 2005-04-27, 8:57 pm |
| michal.shmueli@gmail.com wrote:
> Actually, i don't want to use the html parser- it's ok, but i need to
> parse more patterns which are not part of the table.
Not sure I follow you. The more complex the task is, the more likely a
parsing module is suitable.
> so anyway i tried the follow as you suggested:
> while(<FILE> ){
> while(~ gm/<td class="year" rowspan="2">(\d+)./) {print OUT "\t$1";}
----------^-^------------------------------------^
That's not what I suggested.
- The '~' character is still there. (I suppose you don't know what
it's supposed to do.)
- Modifiers shall be appended, not prepended, to the regex.
- The dot is still redundant.
For a regex to be a suitable alternative to a module (in certain cases),
you need to know how regexes work. It's obvious that you need to read up
on it:
perldoc perlrequick
perldoc perlretut
perldoc perlre
Good luck!
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
| |
| Hendrik Maryns 2005-04-27, 8:57 pm |
| michal.shmueli@gmail.com uitte de volgende tekst op 27/04/2005 21:17:
> Actually, i don't want to use the html parser- it's ok, but i need to
> parse more patterns which are not part of the table. so anyway i tried
> the follow as you suggested:
> while(<FILE> ){
> while(~ gm/<td class="year" rowspan="2">(\d+)./) {print OUT "\t$1";}
>
> now i get some compliation errors.
> the original line (part) is : <td class="year" rowspan="2">2004</td><td
> class="veh" rowspan="2"><a
The g should come at the end:
while(<FILE> ){
while(~ m/<td class="year" rowspan="2">(\d+)./g) {print OUT "\t$1";}
Furthermore, I don't see what this ~ is doing there, and you don't need
the final dot:
while(m/<td class="year" rowspan="2">(\d+)/g) {print OUT "\t$1"}
or, morre perlish
print OUT "\t$1" while(m/<td class="year" rowspan="2">(\d+)/g);
HTH, H.
--
Hendrik Maryns
Interesting websites:
www.lieverleven.be (I cooperate)
www.eu04.com European Referendum Campaign
aouw.org The Art Of Urban Warfare
| |
| Tad McClellan 2005-04-27, 8:57 pm |
| michal.shmueli@gmail.com <michal.shmueli@gmail.com> wrote:
> so anyway i tried
> the follow as you suggested:
> while(<FILE> ){
> while(~ gm/<td class="year" rowspan="2">(\d+)./) {print OUT "\t$1";}
^
^
That is the bitwise negation operator.
Why are you using that there?
What is the point of the final dot in your pattern?
> now i get some compliation errors.
Well yes, because that is not the change that was suggested.
It was suggested to add a "g" option to the pattern match operator.
See perlop.pod for how to add pattern match options.
--
Tad McClellan SGML consulting
tadmc@augustmail.com Perl programming
Fort Worth, Texas
| |
| michal.shmueli@gmail.com 2005-04-27, 8:57 pm |
| thanks for all the help.
it seems to work fine. what i need is search for 3 different patterns
which appears in this line many times- they always appear in this
order. Moreover the second one (listing_id may appear twice)
so the code i wrote:
while(<FILE> ){
while((m/<td class="year" rowspan="2">(\d+)/g) ||
((m/listing_id=(\d+)/g) ||
((m/<td class="price">(\S+)/g) ||) {print OUT "\t$1";}
but it seems to be an infinite loop
any ideas?
Michal
Hendrik Maryns wrote:
> michal.shmueli@gmail.com uitte de volgende tekst op 27/04/2005 21:17:
to[color=darkred]
tried[color=darkred]
"\t$1";}[color=darkred]
rowspan="2">2004</td><td[color=darkred]
>
> The g should come at the end:
>
> while(<FILE> ){
> while(~ m/<td class="year" rowspan="2">(\d+)./g) {print OUT
"\t$1";}
>
> Furthermore, I don't see what this ~ is doing there, and you don't
need
> the final dot:
>
> while(m/<td class="year" rowspan="2">(\d+)/g) {print OUT "\t$1"}
>
> or, morre perlish
>
> print OUT "\t$1" while(m/<td class="year" rowspan="2">(\d+)/g);
>
> HTH, H.
>
> --
> Hendrik Maryns
>
> Interesting websites:
> www.lieverleven.be (I cooperate)
> www.eu04.com European Referendum Campaign
> aouw.org The Art Of Urban Warfare
| |
| Gunnar Hjalmarsson 2005-04-28, 3:59 am |
| michal.shmueli@gmail.com wrote:
> what i need is search for 3 different patterns
> which appears in this line many times- they always appear in this
> order. Moreover the second one (listing_id may appear twice)
>
> so the code i wrote:
>
> while(<FILE> ){
> while((m/<td class="year" rowspan="2">(\d+)/g) ||
> ((m/listing_id=(\d+)/g) ||
> ((m/<td class="price">(\S+)/g) ||) {print OUT "\t$1";}
>
>
> but it seems to be an infinite loop
No it's not, since it doesn't even complile. Please copy and paste code
that you post; don't re-type it!
> any ideas?
Your approach seems odd to me, and I prefer not to comment on it. This
is an alternative approach:
my $s1 = '<td\s+class="year"\s+rowspan="2">(\d+)';
my $s2 = 'listing_id=(\d+)';
my $s3 = '<td\s+class="price">(\d+(?:\.\d+)?)';
my $pattern = qr($s1|$s2|$s3);
my $data = do { local $/; <FILE> };
print "\t$+" while $data =~ /$pattern/g;
It has a few advantages compared to what you were trying to do, but
there are most certainly details, that only you know about, requiring
further tweaking. For instance, the pattern for price:
\d+(?:\.\d+)?
may or may not be correct in this case. Maybe you'd use Regex::Common's
method for matching numbers instead. In any case, I doubt that just \S+
will give you what you want.
Please use the docs if there are details in the above suggestion that
you don't understand.
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
|
|
|
|
|