Home > Archive > PERL Programming > October 2005 > HTML::TokeParser
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
|
|
|
| Hi,
I'm trying to get tokeparser to fetch a series of hyperlinks and print the
URL followed by the link text.
The following script ("eurofeed.pl") gives me "Can't coerce array into hash
at eurofeed.pl line 31"
Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink')
{"
The HTML looks like this:
=======================================
<td colspan="2"> </td>
<td align="left" colspan="3">
<a title="" class="docSel-titleLink"
href="pressReleasesAction.do?reference=EPSO/05/06">
My link text here
</a>
</td>
</tr>
---------------------------------------------
My script looks like this:
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::TokeParser;
use XML::RSS;
my $content =
et( "http://europa.eu.int/rapid/recentPressReleasesAction.do?guiLanguage=en&
hits=500" ) or die $!;
my $stream = HTML::TokeParser->new( \$content ) or die $!;
my ($tag, $headline, $url);
while ( $tag = $stream->get_tag("a") ) {
if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink') {
$url = $tag->[2]{href} || "--";
$headline = $stream->get_trimmed_text('/a')
print $url
print $headline
-----------------------------------------------------------
I think the problem lies in the ordering of tags, but that's as far as I've
got with working out what's wrong.
| |
| Stephen Hildrey 2005-10-16, 6:55 pm |
| DVH wrote:
> I'm trying to get tokeparser to fetch a series of hyperlinks and print the
> URL followed by the link text.
>
> The following script ("eurofeed.pl") gives me "Can't coerce array into hash
> at eurofeed.pl line 31"
>
> Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink')
You probably want ->[1] rather than ->[2]
Regards,
Steve
--
Stephen Hildrey
E-mail: steve@uptime.org.uk / Tel: +442071931337
Jabber: steve@jabber.earth.li / MSN: foo@hotmail.co.uk
| |
| it_says_BALLS_on_your forehead 2005-10-16, 6:55 pm |
|
DVH wrote:
> Hi,
>
> I'm trying to get tokeparser to fetch a series of hyperlinks and print the
> URL followed by the link text.
>
> The following script ("eurofeed.pl") gives me "Can't coerce array into hash
> at eurofeed.pl line 31"
>
> Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink')
> {"
>
> The HTML looks like this:
>
> =======================================
>
> <td colspan="2"> </td>
>
> <td align="left" colspan="3">
>
> <a title="" class="docSel-titleLink"
> href="pressReleasesAction.do?reference=EPSO/05/06">
>
> My link text here
>
> </a>
>
> </td>
>
> </tr>
>
> ---------------------------------------------
>
> My script looks like this:
>
> #!/usr/bin/perl -w
>
> use strict;
>
> use LWP::Simple;
>
> use HTML::TokeParser;
>
> use XML::RSS;
>
> my $content =
> et( "http://europa.eu.int/rapid/recentPressReleasesAction.do?guiLanguage=en&
> hits=500" ) or die $!;
>
> my $stream = HTML::TokeParser->new( \$content ) or die $!;
>
> my ($tag, $headline, $url);
>
> while ( $tag = $stream->get_tag("a") ) {
>
> if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink') {
>
> $url = $tag->[2]{href} || "--";
>
> $headline = $stream->get_trimmed_text('/a')
>
> print $url
>
> print $headline
>
> -----------------------------------------------------------
>
> I think the problem lies in the ordering of tags, but that's as far as I've
> got with working out what's wrong.
after searching on CPAN for HTML::TokeParser, and looking at the
$p->get_tag( @tags ) method,
it looks like:
The tag information is returned as an array reference in the same form
as for $p->get_token above, but the type code (first element) is
missing. A start tag will be returned like this:
[$tag, $attr, $attrseq, $text]
The tagname of end tags are prefixed with "/", i.e. end tag is returned
like this:
["/$tag", $text]
....so you get an array reference back. why are you adding {class} into
your code?
| |
| it_says_BALLS_on_your forehead 2005-10-16, 6:55 pm |
|
it_says_BALLS_on_your forehead wrote:
> DVH wrote:
>
> after searching on CPAN for HTML::TokeParser, and looking at the
> $p->get_tag( @tags ) method,
> it looks like:
>
> The tag information is returned as an array reference in the same form
> as for $p->get_token above, but the type code (first element) is
> missing. A start tag will be returned like this:
>
> [$tag, $attr, $attrseq, $text]
> The tagname of end tags are prefixed with "/", i.e. end tag is returned
> like this:
>
> ["/$tag", $text]
>
> ...so you get an array reference back. why are you adding {class} into
> your code?
ahh, my mistake...
use HTML::TokeParser;
$p = HTML::TokeParser->new(shift||"index.html");
while (my $token = $p->get_tag("a")) {
my $url = $token->[1]{href} || "-";
my $text = $p->get_trimmed_text("/a");
print "$url\t$text\n";
}
....yeah, you need to look at index 1, not index 2.
| |
|
|
it_says_BALLS_on_your forehead <simon.chao@fmr.com> wrote in message
news:1129485772.266262.220750@g43g2000cwa.googlegroups.com...
>
> it_says_BALLS_on_your forehead wrote:
the[color=darkred]
hash[color=darkred]
'docSel-titleLink')[color=darkred]
t( "http://europa.eu.int/rapid/recentPressReleasesAction.do?guiLanguage=en&[color=darkred]
I've[color=darkred]
>
> ahh, my mistake...
> use HTML::TokeParser;
> $p = HTML::TokeParser->new(shift||"index.html");
>
> while (my $token = $p->get_tag("a")) {
> my $url = $token->[1]{href} || "-";
> my $text = $p->get_trimmed_text("/a");
> print "$url\t$text\n";
> }
>
> ...yeah, you need to look at index 1, not index 2.
>
Thanks. It works with [1].
| |
|
|
Stephen Hildrey <steve@uptime.org.uk> wrote in message
news:1129484153.30203.0@doris.uk.clara.net...
> DVH wrote:
the[color=darkred]
hash[color=darkred]
'docSel-titleLink')[color=darkred]
>
> You probably want ->[1] rather than ->[2]
I did. I had thought it would be tag[2] because I was looking for the third
tag within those brackets, but obviously not.
Thank you, that now works. I have a couple more questions (ah they always
do...)
Firstly, the HTML puts a lot of whitespace in the middle of the hrefs. Is
there a reasonably simple way of getting rid of that? The site is at
http://europa.eu.int/rapid/recentPr...nguage=en&hits=
10 if you need to see it.
Secondly, I'm working towards getting following those hrefs and then parsing
the text I find there. Would I be better off using WWW::Mechanize to do
this?
Thanks again for your help.
| |
| A. Sinan Unur 2005-10-16, 6:55 pm |
| "DVH" <dvh@dvhdvhdvhdvdh.dvh> wrote in
news:diug96$jfj$1@nwrdmz02.dmz.ncs.ea.ibs-infra.bt.com:
>
> Stephen Hildrey <steve@uptime.org.uk> wrote in message
> news:1129484153.30203.0@doris.uk.clara.net...
> 'docSel-titleLink')
>
> I did. I had thought it would be tag[2] because I was looking for the
> third tag within those brackets, but obviously not.
>
> Thank you, that now works. I have a couple more questions (ah they
> always do...)
>
> Firstly, the HTML puts a lot of whitespace in the middle of the hrefs.
ITYM "the HTML contains".
> Is there a reasonably simple way of getting rid of that? The site is
> at
> http://europa.eu.int/rapid/recentPr...easesAction.do?
guiLanguage=en&
> hits= 10 if you need to see it.
>
> Secondly, I'm working towards getting following those hrefs and then
> parsing the text I find there. Would I be better off using
> WWW::Mechanize to do this?
#!/usr/bin/perl
use strict;
use warnings;
use HTML::LinkExtractor;
use LWP::Simple;
my $url = q{http://europa.eu.int/rapid/recentPr...easesAction.do?
guiLanguage=en};
my $html = get $url;
die "Cannot get <$url>\n" unless $html;
my $lx = HTML::LinkExtractor->new;
$lx->parse(\$html);
use Data::Dumper;
for my $link ( @{ $lx->links } ) {
if ($link->{class} eq 'docSel-formatLink') {
print Dumper $link;
}
}
__END__
--
A. Sinan Unur <1usa@llenroc.ude.invalid>
(reverse each component and remove .invalid for email address)
comp.lang.perl.misc guidelines on the WWW:
http://mail.augustmail.com/~tadmc/c...guidelines.html
| |
|
|
A. Sinan Unur <1usa@llenroc.ude.invalid> wrote in message
news:Xns96F1B3F245A6asu1cornelledu@127.0.0.1...
> "DVH" <dvh@dvhdvhdvhdvdh.dvh> wrote in
> news:diug96$jfj$1@nwrdmz02.dmz.ncs.ea.ibs-infra.bt.com:
>
>
> ITYM "the HTML contains".
>
>
> guiLanguage=en&
>
> #!/usr/bin/perl
>
> use strict;
> use warnings;
>
> use HTML::LinkExtractor;
> use LWP::Simple;
>
> my $url = q{http://europa.eu.int/rapid/recentPr...easesAction.do?
> guiLanguage=en};
> my $html = get $url;
>
> die "Cannot get <$url>\n" unless $html;
>
> my $lx = HTML::LinkExtractor->new;
> $lx->parse(\$html);
>
> use Data::Dumper;
>
> for my $link ( @{ $lx->links } ) {
> if ($link->{class} eq 'docSel-formatLink') {
> print Dumper $link;
> }
> }
>
>
> __END__
Sorry for getting back to you three days late, but thanks to both of you.
| |
| A. Sinan Unur 2005-10-19, 6:56 pm |
| "DVH" <dvh@dvhdvhdvhdvdh.dvh> wrote in news:dj6a0n$7a8$1
@nwrdmz01.dmz.ncs.ea.ibs-infra.bt.com:
> A. Sinan Unur <1usa@llenroc.ude.invalid> wrote in message
> news:Xns96F1B3F245A6asu1cornelledu@127.0.0.1...
....
> Sorry for getting back to you three days late, but thanks to both
> of you.
You are welcome. Hope it helped.
Sinan
--
A. Sinan Unur <1usa@llenroc.ude.invalid>
(reverse each component and remove .invalid for email address)
comp.lang.perl.misc guidelines on the WWW:
http://mail.augustmail.com/~tadmc/c...guidelines.html
|
|
|
|
|