Home > Archive > PERL CGI Beginners > August 2005 > count how many tags
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
count how many tags
|
|
| Adriano Allora 2005-08-31, 6:55 pm |
| dear all,
I didn't understand how to use the module HTML, but I need to count how
many tags of several types appear in a web page and so I wrote this
script.
Someone can tell me why this one doesn't work?
%tags = ("paragraph" => "p",
"list_o" => "ol",
"list_no" => "ul",
"title" => "h1",
"ltl_title" => "h2|3|4|5",
"link" => "href");
while(<$filename> )
{
foreach $var (keys(%tags))
{
$$var += count_it($$var, $var, $tags{$var},$_);
}
}
foreach $var(keys(%tags))
{
print STDOUT "$var: $$var $br\n";
}
sub countit
{
$actual = shift();
$descrizione = shift();
$tag= shift();
LOOP: if(/<$tag[^>]*>/)
{
$actual++;
s/<\/?$tag[^>]*>//;
goto LOOP;
}
return $actual;
}
|^|_|^|_|^| |^|_|^|_|^|
| | | |
| | | |
| |*\_/*\_/*\_/*\_/*\_/* | |
| |
| |
| |
| http://www.e-allora.net |
| |
| |
**************************************
| |
| Adriano Allora 2005-08-31, 6:55 pm |
| well, the name of the sub is correct. I translated it and I add an
error only in my e-mail.
alladr
>
> while(<$filename> )
> {
> foreach $var (keys(%tags))
> {
> $$var += count_it($$var, $var, $tags{$var},$_);
> }
> }
> foreach $var(keys(%tags))
> {
> print STDOUT "$var: $$var $br\n";
> }
>
>
>
> sub count_it
[...]
alladr
|^|_|^|_|^| |^|_|^|_|^|
| | | |
| | | |
| |*\_/*\_/*\_/*\_/*\_/* | |
| |
| |
| |
| http://www.e-allora.net |
| |
| |
**************************************
| |
|
| --- Adriano Allora <all.adr@e-allora.net> wrote:
> I didn't understand how to use the module HTML, but I need to count
> how
> many tags of several types appear in a web page and so I wrote this
> script.
>
> Someone can tell me why this one doesn't work?
>
> %tags = ("paragraph" => "p",
> "list_o" => "ol",
> "list_no" => "ul",
> "title" => "h1",
> "ltl_title" => "h2|3|4|5",
> "link" => "href");
>
First, I would suggest that you're trying to count two different
things, tags and attributes. You may wish to separate them. The
following code will do what you want. It uses the
HTML::TokeParser::Simple module to make this relatively easy to read.
Whether or not the data structures are the best way to handle this is
another story.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TokeParser::Simple 3.13;
my $parser = HTML::TokeParser::Simple->new( handle => \*DATA );
my %tag_for = (
"paragraph" => { name => "p", count => 0 },
"list_o" => { name => "ol", count => 0 },
"list_no" => { name => "ul", count => 0 },
"title" => { name => "h1", count => 0 },
"ltl_title" => { name => qr/h[2345]/, count => 0 },
);
my %attribute_for = ( "link" => { name => "href", count => 0 } );
while ( my $token = $parser->get_tag ) {
foreach my $tag ( keys %tag_for ) {
if ( $token->is_start_tag( $tag_for{$tag}{name} ) ) {
$tag_for{$tag}{count}++;
last;
}
}
foreach my $attribute ( keys %attribute_for ) {
if ( $token->get_attr( $attribute_for{$attribute}{name} ) ) {
$attribute_for{$attribute}{count}++;
last;
}
}
}
foreach my $type ( keys %tag_for ) {
printf "%10s %3d\n", $type, $tag_for{$type}{count};
}
print "\n";
foreach my $type ( keys %attribute_for ) {
printf "%10s %3d\n", $type, $attribute_for{$type}{count};
}
__DATA__
<head></head>
<body>
<h1>title</h1>
<p>One P tag</p>
<ul>
<li>item</li>
</ul>
<h2>Little title 1</h2>
<h2>Little title 2</h2>
<h3>Little title 3</h3>
<a href="foo.html">asdf</a>
</body>
And the output:
list_o 0
list_no 1
title 1
ltl_title 3
paragraph 1
link 1
Cheers,
Ovid
--
If this message is a response to a question on a mailing list, please send
follow up questions to the list.
Web Programming with Perl -- http://users.easystreet.com/ovid/cgi_course/
| |
|
|
--- Ovid <publiustemp-beginnerscgi2@yahoo.com> wrote:
> First, I would suggest that you're trying to count two different
> things, tags and attributes. You may wish to separate them. The
> following code will do what you want. It uses the
> HTML::TokeParser::Simple module to make this relatively easy to read.
I should have also mentioned that I duplicated some code in there to
make it a relatively straight-forward read. You'll probably want to
abstract the duplicate bits out into subroutines.
Cheers,
Ovid
--
If this message is a response to a question on a mailing list, please send
follow up questions to the list.
Web Programming with Perl -- http://users.easystreet.com/ovid/cgi_course/
|
|
|
|
|