For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > March 2005 > HTML parsing









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author HTML parsing
Daniel Smith

2005-03-29, 3:56 pm

Hi all,

I'm brand new to Perl, and have just a little programming background. I was tasked with parsing a set of .html files in order to extract the data contained within some terribly formatted tables. Here is a sample of what I have.....

<tr>
<th align="left" width="10%"><font size="-1">Data to be extracted </font></th>
<td width="30%"><font size="-1">
DATA DATA DATA
</font></td>
<th align="left" width="10%"><font size="-1">Need this too</font></th>
<td colspan="3" valign="top"><font size="-1">More data I need to get out</font></td>
</tr>

This is one row from the typical four row table that is returned as a search result. There are 25 of these four row tables per page. Could someone point me in the right direction as to how I might go about doing this? A colleague of mine told me "put the file into an array and use the 'split' command"....while I vaguely understand the concept, I'm not sure about the syntax. Can anyone shed some light?

Thanks in advance,

Dan


Felix Geerinckx

2005-03-29, 3:56 pm

On 28/03/2005, Daniel Smith wrote:

> I was tasked with parsing a set of .html files in order to extract
> the data contained within some terribly formatted tables.


[...]

> Can anyone shed some light?


I used HTML::Treebuilder on a similar project once:

#! /usr/bin/perl
use warnings;
use strict;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new;
$tree->parse_file('yourfile.html') or die "Cannot open file: $!";

# Get tables
my @tables = $tree->look_down( '_tag', 'table' );
for my $t (@tables) {
# Get rows
my @rows = $t->look_down('_tag', 'tr');
for my $r (@rows) {
print "Row contents:\n";
# Get 'th' and 'td' cells
my @cells = $r->look_down('_tag', qr/(th|td)/);
for my $c (@cells) {
print "\t", $c->as_text(), "\n";
}
}
}
$tree->delete();

--
felix
Offer Kaye

2005-03-29, 3:56 pm

On Mon, 28 Mar 2005 15:49:38 -0500, Daniel Smith wrote:
> Hi all,
>
> I'm brand new to Perl, and have just a little programming background. I was tasked with parsing
> a set of .html files in order to extract the data contained within some terribly formatted tables.
> Here is a sample of what I have.....
>
> <tr>
> <th align="left" width="10%"><font size="-1">Data to be extracted </font></th>
> <td width="30%"><font size="-1">
> DATA DATA DATA
> </font></td>
> <th align="left" width="10%"><font size="-1">Need this too</font></th>
> <td colspan="3" valign="top"><font size="-1">More data I need to get out</font></td>
> </tr>
>
> This is one row from the typical four row table that is returned as a search result. There are 25
> of these four row tables per page. Could someone point me in the right direction as to how I
> might go about doing this? A colleague of mine told me "put the file into an array and use the
> 'split' command"....while I vaguely understand the concept, I'm not sure about the syntax. Can
> anyone shed some light?
>
> Thanks in advance,
>
> Dan
>


Hi Dan,
I would recommend against using a split or regexp based approach, as
any such approach is bound to be very fragile when parsing HTML. It is
much better to use a module. Here is one example, using
HTML::TokeParser :
################### begin code
use strict;
use warnings;
use Data::Dumper;
use HTML::TokeParser;

my @all_data; # an array to hold the data
# Parse the HTML
my $parser = HTML::TokeParser->new("input.html") || die "Can't open
input file input.html: $!";
# Search for a font tag and extract the data.
while (defined(my $token = $parser->get_tag("font"))) {
my $data = $parser->get_text; #get the data
$data =~ s/^\s+//; #get rid of extra whitespace the
$data =~ s/\s+$//; # the beginning and end
push @all_data,$data; # save the data
}

print Dumper(\@all_data);
################### end code

This approach assumes that the data always comes after a font tag
(based on your example data). If this isn't the case, the code has to
change, but it is a lot easier to do if you use HTML::TokeParser than
if you do so using split.
If you insist on using split, read "perldoc -f split".

Hope this helps,
--
Offer Kaye
Charles K. Clarkson

2005-03-29, 3:56 pm

Offer Kaye <mailto:offer.kaye@gmail.com> wrote:

: while (defined(my $token = $parser->get_tag("font"))) {
: my $data = $parser->get_text; #get the data
: $data =~ s/^\s+//; #get rid of extra whitespace the
: $data =~ s/\s+$//; # the beginning and end
: push @all_data,$data; # save the data
: }

I am a big fan of HTML::Tokeparser. You can avoid the
regular expressions above by using the get_trimmed_text()
method.

while ( defined( my $token = $parser->get_tag('font') ) ) {

push @all_data, $parser->get_trimmed_text();

}

Technically, we don't need $token, but it is often
useful to have it on hand.

while ( $parser->get_tag('font') ) {
push @all_data, $parser->get_trimmed_text();
}


HTH,

Charles K. Clarkson
--
Mobile Homes Specialist
254 968-8328



FreeFall

2005-03-30, 3:56 pm

I am new to Perl too and have a little try:

#!/usr/bin/perl
use warnings;
use strict;

my @data;
while (<> ) {
chomp;
push @data,$_ if !/^\</;
}


On Mon, 28 Mar 2005 15:49:38 -0500
"Daniel Smith" <dsmith@pop200.gsfc.nasa.gov> wrote:

> Hi all,
>
> I'm brand new to Perl, and have just a little programming background. I was tasked with parsing a set of .html files in order to extract the data contained within some terribly formatted tables. Here is a sample of what I have.....
>
> <tr>
> <th align="left" width="10%"><font size="-1">Data to be extracted </font></th>
> <td width="30%"><font size="-1">
> DATA DATA DATA
> </font></td>
> <th align="left" width="10%"><font size="-1">Need this too</font></th>
> <td colspan="3" valign="top"><font size="-1">More data I need to get out</font></td>
> </tr>
>
> This is one row from the typical four row table that is returned as a search result. There are 25 of these four row tables per page. Could someone point me in the right direction as to how I might go about doing this? A colleague of mine told me "put

the file into an array and use the 'split' command"....while I vaguely understand the concept, I'm not sure about the syntax. Can anyone shed some light?
>
> Thanks in advance,
>
> Dan
>
>



--
Whatever you do will be insignificant,but
the important is you do it!
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2009 codecomments.com