Home > Archive > PERL Beginners > November 2006 > TableContentParser output question
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
TableContentParser output question
|
|
| Dennis Bourn 2006-11-17, 9:57 pm |
| Im trying to hack together a perl script to screen scrape some data from
a table on a webpage and enter that data into a MySQL database.
This would be my first attempt using perl and HTML::TableContentParser.
The following script was created using bits and pieces ive found on
various perl examples on the web;
-------------------------------
#!/usr/bin/perl
#use strict;
use lib '/opt/local/lib/perl5/vendor_perl/5.8.6/';
use HTML::TableContentParser;
my $url =
'http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_wsn=8383';
use LWP::Simple;
my $content = get $url;
die "Couldn't get $url" unless defined $content;
$p = HTML::TableContentParser->new();
my $tables = $p->parse($content);
for $t (@$tables) {
for $r (@{$t->{rows}}) {
print "Row: ";
for $c (@{$r->{cells}}) {
print "[$c->{data}] ";
}
print "\n";
}
}
----------------------------------
My question is how do I refer to a specific entry,.. such as table 1 row
2 tabledata 2 without the loop?
If you were to look at the web page im scraping from you can see its
data on an oil well,.. I am only interested in the first 4 tables. I
want to set variables to each entry (my $serial =) so i can eaisly get
them into a database.
Does anyone have any insight that might help me out?
Dennis Bourn
GeoTech
CLK Energy
| |
| Rob Dixon 2006-11-17, 9:57 pm |
| Dennis Bourn wrote:
>
> Im trying to hack together a perl script to screen scrape some data from
> a table on a webpage and enter that data into a MySQL database.
> This would be my first attempt using perl and HTML::TableContentParser.
>
> The following script was created using bits and pieces ive found on
> various perl examples on the web;
> -------------------------------
> #!/usr/bin/perl
> #use strict;
> use lib '/opt/local/lib/perl5/vendor_perl/5.8.6/';
> use HTML::TableContentParser;
> my $url = 'http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_wsn=8383';
>
> use LWP::Simple;
> my $content = get $url;
> die "Couldn't get $url" unless defined $content;
> $p = HTML::TableContentParser->new();
> my $tables = $p->parse($content);
> for $t (@$tables) {
> for $r (@{$t->{rows}}) {
> print "Row: ";
> for $c (@{$r->{cells}}) {
> print "[$c->{data}] ";
> } print "\n"; }
> }
> ----------------------------------
> My question is how do I refer to a specific entry,.. such as table 1 row
> 2 tabledata 2 without the loop?
>
> If you were to look at the web page im scraping from you can see its
> data on an oil well,.. I am only interested in the first 4 tables. I
> want to set variables to each entry (my $serial =) so i can eaisly get
> them into a database.
>
> Does anyone have any insight that might help me out?
Hi Dennis
Well I was in the process of writing code to use HTML::TableContentParser just
to prove that you shouldn't use it when the module shot itself in the foot. If
you write
my $tables = $p->parse($content);
use Data::Dumper;
print Dumper $tables;
then you will see that it has lost the data for the third table altogether (the
latitude and longitude). Checking the HTML reveals that this is because that
table has a missing <tr> tag which confuses parser. Much better to use
HTML::TableExtract which, although not perfect, has a better pedigree and is
fine for this purpose. It's also much better at handling incorrect HTML.
The program below parses the HTML, then dumps the data with the tables_dump
method. The output from this alone may be adequate for you. It then goes on to
push all the headers and data from the first four tables onto two arrays and
then print those formatted in parallel. It seems to do what you want.
Some comments on your own code though. *Never* give up and comment out 'use
strict' - it exists to help you by saving you from yourself and removing it is
much like disabling your smoke alarm so that it doesn't make a noise when you
burn the toast. Secondly, your 'use lib' statement looks suspicious. It's
occasionally necessary to point to a separate directory for a development
version of a library, but this is a public release which should have been
installed somewhere in one of the include paths. Again, fix the problem rather
than making it work anyway.
I hope this helps.
Rob
use strict;
use warnings;
use LWP::Simple;
use HTML::TableExtract;
use List::Util qw/max/;
my $url = 'http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_wsn=8383';
my $content = get $url or die "Couldn't get $url";
my $htex = HTML::TableExtract->new;
$htex->parse($content);
print $htex->tables_dump(1);
print "\n\n";
my @tables = $htex->tables;
my (@header, @data);
foreach my $table (@tables[0..3]) {
push @header, $table->row(0);
push @data, $table->row(1);
}
my $len = max map length, @header;
my $i = 0;
foreach my $head (@header) {
printf "%-*s = %s\n", $len, $head, $data[$i++];
}
|
|
|
|
|