For Programmers: Free Programming Magazines  


Home > Archive > PERL CGI Beginners > August 2005 > count how many tags









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author count how many tags
Adriano Allora

2005-08-31, 6:55 pm

dear all,

I didn't understand how to use the module HTML, but I need to count how
many tags of several types appear in a web page and so I wrote this
script.

Someone can tell me why this one doesn't work?


%tags = ("paragraph" => "p",
"list_o" => "ol",
"list_no" => "ul",
"title" => "h1",
"ltl_title" => "h2|3|4|5",
"link" => "href");



while(<$filename> )
{
foreach $var (keys(%tags))
{
$$var += count_it($$var, $var, $tags{$var},$_);
}
}
foreach $var(keys(%tags))
{
print STDOUT "$var: $$var $br\n";
}



sub countit
{
$actual = shift();
$descrizione = shift();
$tag= shift();
LOOP: if(/<$tag[^>]*>/)
{
$actual++;
s/<\/?$tag[^>]*>//;
goto LOOP;
}
return $actual;
}


|^|_|^|_|^| |^|_|^|_|^|
| | | |
| | | |
| |*\_/*\_/*\_/*\_/*\_/* | |
| |
| |
| |
| http://www.e-allora.net |
| |
| |
**************************************

Adriano Allora

2005-08-31, 6:55 pm

well, the name of the sub is correct. I translated it and I add an
error only in my e-mail.

alladr


>
> while(<$filename> )
> {
> foreach $var (keys(%tags))
> {
> $$var += count_it($$var, $var, $tags{$var},$_);
> }
> }
> foreach $var(keys(%tags))
> {
> print STDOUT "$var: $$var $br\n";
> }
>
>
>
> sub count_it


[...]


alladr




|^|_|^|_|^| |^|_|^|_|^|
| | | |
| | | |
| |*\_/*\_/*\_/*\_/*\_/* | |
| |
| |
| |
| http://www.e-allora.net |
| |
| |
**************************************

Ovid

2005-08-31, 6:55 pm

--- Adriano Allora <all.adr@e-allora.net> wrote:

> I didn't understand how to use the module HTML, but I need to count
> how
> many tags of several types appear in a web page and so I wrote this
> script.
>
> Someone can tell me why this one doesn't work?
>
> %tags = ("paragraph" => "p",
> "list_o" => "ol",
> "list_no" => "ul",
> "title" => "h1",
> "ltl_title" => "h2|3|4|5",
> "link" => "href");
>


First, I would suggest that you're trying to count two different
things, tags and attributes. You may wish to separate them. The
following code will do what you want. It uses the
HTML::TokeParser::Simple module to make this relatively easy to read.
Whether or not the data structures are the best way to handle this is
another story.

#!/usr/bin/perl

use strict;
use warnings;
use HTML::TokeParser::Simple 3.13;

my $parser = HTML::TokeParser::Simple->new( handle => \*DATA );

my %tag_for = (
"paragraph" => { name => "p", count => 0 },
"list_o" => { name => "ol", count => 0 },
"list_no" => { name => "ul", count => 0 },
"title" => { name => "h1", count => 0 },
"ltl_title" => { name => qr/h[2345]/, count => 0 },
);

my %attribute_for = ( "link" => { name => "href", count => 0 } );

while ( my $token = $parser->get_tag ) {
foreach my $tag ( keys %tag_for ) {
if ( $token->is_start_tag( $tag_for{$tag}{name} ) ) {
$tag_for{$tag}{count}++;
last;
}
}
foreach my $attribute ( keys %attribute_for ) {
if ( $token->get_attr( $attribute_for{$attribute}{name} ) ) {
$attribute_for{$attribute}{count}++;
last;
}
}
}

foreach my $type ( keys %tag_for ) {
printf "%10s %3d\n", $type, $tag_for{$type}{count};
}
print "\n";
foreach my $type ( keys %attribute_for ) {
printf "%10s %3d\n", $type, $attribute_for{$type}{count};
}
__DATA__
<head></head>
<body>
<h1>title</h1>
<p>One P tag</p>
<ul>
<li>item</li>
</ul>
<h2>Little title 1</h2>
<h2>Little title 2</h2>
<h3>Little title 3</h3>
<a href="foo.html">asdf</a>
</body>

And the output:

list_o 0
list_no 1
title 1
ltl_title 3
paragraph 1

link 1

Cheers,
Ovid

--
If this message is a response to a question on a mailing list, please send
follow up questions to the list.

Web Programming with Perl -- http://users.easystreet.com/ovid/cgi_course/
Ovid

2005-08-31, 6:55 pm



--- Ovid <publiustemp-beginnerscgi2@yahoo.com> wrote:

> First, I would suggest that you're trying to count two different
> things, tags and attributes. You may wish to separate them. The
> following code will do what you want. It uses the
> HTML::TokeParser::Simple module to make this relatively easy to read.


I should have also mentioned that I duplicated some code in there to
make it a relatively straight-forward read. You'll probably want to
abstract the duplicate bits out into subroutines.

Cheers,
Ovid

--
If this message is a response to a question on a mailing list, please send
follow up questions to the list.

Web Programming with Perl -- http://users.easystreet.com/ovid/cgi_course/
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com