For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > July 2004 > parsing HTML









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author parsing HTML
Andrew Gaffney

2004-07-22, 8:55 am

I am trying to build a HTML editor for use with my HTML::Mason site. I intend
for it to support nested tables, SPANs, and anchors. I am looking for a module
that can help me parse existing HTML (custom or generated by my scripts) into a
tree structure similar to:

my $html = [ { tag => 'table', id => 'maintable', width => 300, content =>
[ { tag => 'tr', content =>
[
{ tag => 'td', width => 200, content => "some content" },
{ tag => 'td', width => 100, content => "more content" }
]
]
]; # Not tested, but you get the idea

which would correspond to the following HTML:

<table id="maintable" width="300">
<tr>
<td width="200">some content</td>
<td width="100">more content</td>
</tr>
</table>

Once I have the data in the tree, I can easily modify it and transform it back
into HTML. Is there a module that can help make this easier or should I go about
this differently?

--
Andrew Gaffney
Network Administrator
Skyline Aeronautics, LLC.
636-357-1548

Randy W. Sims

2004-07-22, 8:55 am

On 7/21/2004 10:42 PM, Andrew Gaffney wrote:

> I am trying to build a HTML editor for use with my HTML::Mason site. I
> intend for it to support nested tables, SPANs, and anchors. I am looking
> for a module that can help me parse existing HTML (custom or generated
> by my scripts) into a tree structure similar to:
>
> my $html = [ { tag => 'table', id => 'maintable', width => 300, content =>
> [ { tag => 'tr', content =>
> [
> { tag => 'td', width => 200, content => "some
> content" },
> { tag => 'td', width => 100, content => "more content" }
> ]
> ]
> ]; # Not tested, but you get the idea
>
> which would correspond to the following HTML:
>
> <table id="maintable" width="300">
> <tr>
> <td width="200">some content</td>
> <td width="100">more content</td>
> </tr>
> </table>
>
> Once I have the data in the tree, I can easily modify it and transform
> it back into HTML. Is there a module that can help make this easier or
> should I go about this differently?
>


HTML::Parser doesn't build a tree, but you can use it to build one if
neccessary. However, you might find building a tree is not neccessary.
And this is less memory intensive.

Then there is HTML::Tree.

Regards,
Randy.


Andrew Gaffney

2004-07-22, 8:55 am

Randy W. Sims wrote:
> On 7/21/2004 10:42 PM, Andrew Gaffney wrote:
>
>
> HTML::Parser doesn't build a tree, but you can use it to build one if
> neccessary. However, you might find building a tree is not neccessary.
> And this is less memory intensive.
>
> Then there is HTML::Tree.


I'd rather generate a structure similar to what I have above instead of having a
large tree of class objects that takes up more RAM and is probably slower. How
would I go about generating a structure such as that above using HTML::Parser?

--
Andrew Gaffney
Network Administrator
Skyline Aeronautics, LLC.
636-357-1548

Randy W. Sims

2004-07-22, 8:55 am

On 7/21/2004 11:24 PM, Andrew Gaffney wrote:

> Randy W. Sims wrote:
>

[snip]
[color=darkred]
> I'd rather generate a structure similar to what I have above instead of
> having a large tree of class objects that takes up more RAM and is
> probably slower. How would I go about generating a structure such as
> that above using HTML::Parser?


Parsers like HTML::Parser scan a document and upon encountering certain
tokens fire off events. In the case of HTML::Parser, events are fired
when encountering a start tag, the text between tags, and at the end
tag. If you have an arbitrarily deep document structure like HTML, you
can store the structure using a stack:

#!/usr/bin/perl
package SampleParser;

use strict;

use HTML::Parser;
use base qw(HTML::Parser);

sub start {
my($self, $tagname, $attr, $attrseq, $origtext) = @_;
my $stack = $self->{_stack};
my $depth = $stack ? @$stack : 0;
print ' ' x $depth, "<$tagname>\n";
push @{$self->{_stack}}, ' ';
}

sub end {
my($self, $tagname, $origtext) = @_;
pop @{$self->{_stack}};
my $stack = $self->{_stack};
my $depth = $stack ? @$stack : 0;
print ' ' x $depth, "<\\$tagname>\n";
}

1;

package main;

use strict;
use warnings;

my $p = SampleParser->new();
$p->parse_file(\*DATA);

__DATA__
<html>
<head>
<title>Title</title>
<body>
The body.
</body>
</html>


Andrew Gaffney

2004-07-22, 8:55 am

Randy W. Sims wrote:
> On 7/21/2004 11:24 PM, Andrew Gaffney wrote:
>
>
> [snip]
>
>
>
> Parsers like HTML::Parser scan a document and upon encountering certain
> tokens fire off events. In the case of HTML::Parser, events are fired
> when encountering a start tag, the text between tags, and at the end
> tag. If you have an arbitrarily deep document structure like HTML, you
> can store the structure using a stack:
>
> #!/usr/bin/perl
> package SampleParser;
>
> use strict;
>
> use HTML::Parser;
> use base qw(HTML::Parser);
>
> sub start {
> my($self, $tagname, $attr, $attrseq, $origtext) = @_;
> my $stack = $self->{_stack};
> my $depth = $stack ? @$stack : 0;
> print ' ' x $depth, "<$tagname>\n";
> push @{$self->{_stack}}, ' ';
> }
>
> sub end {
> my($self, $tagname, $origtext) = @_;
> pop @{$self->{_stack}};
> my $stack = $self->{_stack};
> my $depth = $stack ? @$stack : 0;
> print ' ' x $depth, "<\\$tagname>\n";
> }
>
> 1;
>
> package main;
>
> use strict;
> use warnings;
>
> my $p = SampleParser->new();
> $p->parse_file(\*DATA);
>
> __DATA__
> <html>
> <head>
> <title>Title</title>
> <body>
> The body.
> </body>
> </html>


Thanks. In the time it took you to put that together, I came up with the
following to figure out how HTML::Parser works. I'll use your code to expand
upon it.

#!/usr/bin/perl

use strict;
use warnings;

use HTML::Parser ();

sub start {
print "start ";
foreach my $arg (@_) {
if(ref($arg) eq 'HASH') {
foreach my $key(keys %{$arg}) {
print " $key - $arg->{$key}\n";
}
} else {
print "$arg\n";
}
}
}

sub end {
print "end ";
foreach(@_) {
print "$_\n";
}
}

sub text {
my $text = shift;

chomp $text;
print " text - '$text'\n" if($text ne '');
}

my $p = HTML::Parser->new( api_version => 3,
start_h => [\&start, "tagname, attr"],
end_h => [\&end, "tagname"],
text_h => [\&text, "dtext"],
marked_sections => 1 ); # Not sure what this does

$p->parse_file("test.html");

The above gives me the expected output for the sample HTML I provided before.

--
Andrew Gaffney
Network Administrator
Skyline Aeronautics, LLC.
636-357-1548

Andrew Gaffney

2004-07-22, 8:55 am

Andrew Gaffney wrote:
> Randy W. Sims wrote:
>

<SNIP>
[color=darkred]
> Thanks. In the time it took you to put that together, I came up with the
> following to figure out how HTML::Parser works. I'll use your code to
> expand upon it.


<SNIP>

Here is my current working code. Please take a look at it and see if there are
any obvious (or not so obvious) problems. I thought this would end up being far
more difficult.

parsehtml.pl
============
#!/usr/bin/perl

use strict;
use warnings;

use HTML::Parser ();

my $htmltree = [ { tag => 'document', content => [] } ];
my $node = $htmltree->[0]->{content};
my @prevnodes = ($htmltree);

sub start {
my $tagname = shift;
my $attr = shift;
my $newnode = {};

$newnode->{tag} = $tagname;
foreach my $key(keys %{$attr}) {
$newnode->{$key} = $attr->{$key};
}
$newnode->{content} = [];
push @prevnodes, $node;
push @{$node}, $newnode;
$node = $newnode->{content};
}

sub end {
my $tagname = shift;

$node = pop @prevnodes;
}

sub text {
my $text = shift;

chomp $text;
if($text ne '') {
push @{$node}, $text;
}
}

my $p = HTML::Parser->new( api_version => 3,
start_h => [\&start, "tagname, attr"],
end_h => [\&end, "tagname"],
text_h => [\&text, "dtext"] );

$p->parse_file("test.html");

use Data::Dumper;
print Dumper $htmltree;

test.html
=========
<table id="maintable" width="300">
<tr>
<td width="200">some content</td>
<td width="100">more content</td>
</tr>
</table>

--
Andrew Gaffney
Network Administrator
Skyline Aeronautics, LLC.
636-357-1548

Randy W. Sims

2004-07-22, 8:56 pm

Andrew Gaffney wrote:
> Here is my current working code. Please take a look at it and see if
> there are any obvious (or not so obvious) problems. I thought this would
> end up being far more difficult.


<snip code>

Looks good to me. Once you get used to the idea of event based parsing,
storing context information on a stack, it's really simple, and even
fun. Another nice thing is once you've mastered one (HTML::Parser),
you've mastered them all (Pod::Parser, XML::Parser, etc.).

Regards,
Randy.
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com