Home > Archive > PERL Beginners > March 2005 > HTML parsing
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
|
|
| Daniel Smith 2005-03-29, 3:56 pm |
| Hi all,
I'm brand new to Perl, and have just a little programming background. I was tasked with parsing a set of .html files in order to extract the data contained within some terribly formatted tables. Here is a sample of what I have.....
<tr>
<th align="left" width="10%"><font size="-1">Data to be extracted </font></th>
<td width="30%"><font size="-1">
DATA DATA DATA
</font></td>
<th align="left" width="10%"><font size="-1">Need this too</font></th>
<td colspan="3" valign="top"><font size="-1">More data I need to get out</font></td>
</tr>
This is one row from the typical four row table that is returned as a search result. There are 25 of these four row tables per page. Could someone point me in the right direction as to how I might go about doing this? A colleague of mine told me "put the file into an array and use the 'split' command"....while I vaguely understand the concept, I'm not sure about the syntax. Can anyone shed some light?
Thanks in advance,
Dan
| |
| Felix Geerinckx 2005-03-29, 3:56 pm |
| On 28/03/2005, Daniel Smith wrote:
> I was tasked with parsing a set of .html files in order to extract
> the data contained within some terribly formatted tables.
[...]
> Can anyone shed some light?
I used HTML::Treebuilder on a similar project once:
#! /usr/bin/perl
use warnings;
use strict;
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new;
$tree->parse_file('yourfile.html') or die "Cannot open file: $!";
# Get tables
my @tables = $tree->look_down( '_tag', 'table' );
for my $t (@tables) {
# Get rows
my @rows = $t->look_down('_tag', 'tr');
for my $r (@rows) {
print "Row contents:\n";
# Get 'th' and 'td' cells
my @cells = $r->look_down('_tag', qr/(th|td)/);
for my $c (@cells) {
print "\t", $c->as_text(), "\n";
}
}
}
$tree->delete();
--
felix
| |
| Offer Kaye 2005-03-29, 3:56 pm |
| On Mon, 28 Mar 2005 15:49:38 -0500, Daniel Smith wrote:
> Hi all,
>
> I'm brand new to Perl, and have just a little programming background. I was tasked with parsing
> a set of .html files in order to extract the data contained within some terribly formatted tables.
> Here is a sample of what I have.....
>
> <tr>
> <th align="left" width="10%"><font size="-1">Data to be extracted </font></th>
> <td width="30%"><font size="-1">
> DATA DATA DATA
> </font></td>
> <th align="left" width="10%"><font size="-1">Need this too</font></th>
> <td colspan="3" valign="top"><font size="-1">More data I need to get out</font></td>
> </tr>
>
> This is one row from the typical four row table that is returned as a search result. There are 25
> of these four row tables per page. Could someone point me in the right direction as to how I
> might go about doing this? A colleague of mine told me "put the file into an array and use the
> 'split' command"....while I vaguely understand the concept, I'm not sure about the syntax. Can
> anyone shed some light?
>
> Thanks in advance,
>
> Dan
>
Hi Dan,
I would recommend against using a split or regexp based approach, as
any such approach is bound to be very fragile when parsing HTML. It is
much better to use a module. Here is one example, using
HTML::TokeParser :
################### begin code
use strict;
use warnings;
use Data::Dumper;
use HTML::TokeParser;
my @all_data; # an array to hold the data
# Parse the HTML
my $parser = HTML::TokeParser->new("input.html") || die "Can't open
input file input.html: $!";
# Search for a font tag and extract the data.
while (defined(my $token = $parser->get_tag("font"))) {
my $data = $parser->get_text; #get the data
$data =~ s/^\s+//; #get rid of extra whitespace the
$data =~ s/\s+$//; # the beginning and end
push @all_data,$data; # save the data
}
print Dumper(\@all_data);
################### end code
This approach assumes that the data always comes after a font tag
(based on your example data). If this isn't the case, the code has to
change, but it is a lot easier to do if you use HTML::TokeParser than
if you do so using split.
If you insist on using split, read "perldoc -f split".
Hope this helps,
--
Offer Kaye
| |
| Charles K. Clarkson 2005-03-29, 3:56 pm |
| Offer Kaye <mailto:offer.kaye@gmail.com> wrote:
: while (defined(my $token = $parser->get_tag("font"))) {
: my $data = $parser->get_text; #get the data
: $data =~ s/^\s+//; #get rid of extra whitespace the
: $data =~ s/\s+$//; # the beginning and end
: push @all_data,$data; # save the data
: }
I am a big fan of HTML::Tokeparser. You can avoid the
regular expressions above by using the get_trimmed_text()
method.
while ( defined( my $token = $parser->get_tag('font') ) ) {
push @all_data, $parser->get_trimmed_text();
}
Technically, we don't need $token, but it is often
useful to have it on hand.
while ( $parser->get_tag('font') ) {
push @all_data, $parser->get_trimmed_text();
}
HTH,
Charles K. Clarkson
--
Mobile Homes Specialist
254 968-8328
| |
| FreeFall 2005-03-30, 3:56 pm |
| I am new to Perl too and have a little try:
#!/usr/bin/perl
use warnings;
use strict;
my @data;
while (<> ) {
chomp;
push @data,$_ if !/^\</;
}
On Mon, 28 Mar 2005 15:49:38 -0500
"Daniel Smith" <dsmith@pop200.gsfc.nasa.gov> wrote:
> Hi all,
>
> I'm brand new to Perl, and have just a little programming background. I was tasked with parsing a set of .html files in order to extract the data contained within some terribly formatted tables. Here is a sample of what I have.....
>
> <tr>
> <th align="left" width="10%"><font size="-1">Data to be extracted </font></th>
> <td width="30%"><font size="-1">
> DATA DATA DATA
> </font></td>
> <th align="left" width="10%"><font size="-1">Need this too</font></th>
> <td colspan="3" valign="top"><font size="-1">More data I need to get out</font></td>
> </tr>
>
> This is one row from the typical four row table that is returned as a search result. There are 25 of these four row tables per page. Could someone point me in the right direction as to how I might go about doing this? A colleague of mine told me "put
the file into an array and use the 'split' command"....while I vaguely understand the concept, I'm not sure about the syntax. Can anyone shed some light?
>
> Thanks in advance,
>
> Dan
>
>
--
Whatever you do will be insignificant,but
the important is you do it!
|
|
|
|
|