For Programmers: Free Programming Magazines  


Home > Archive > PERL Miscellaneous > December 2004 > Extracting nested tables from HTML









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Extracting nested tables from HTML
Terry

2004-12-31, 8:56 am

Hi!

I have several very large HTML files from which I'd like to extract only
tables nested at the deepest level. I thought this would be quite easy
by extracting something like (<table.*?</table> ) where I'd alter the
'.*?' to test for and reject any new occurrences of a starting table
tag, but I can't seem to get it. Any pointers?

I want to deal with the file at a text level until the tables are
extracted, after which I plan to use HTML::TableContentParser to extract
the needed content.

Thanks for your help.

Terry.
Paul Lalli

2004-12-31, 3:56 pm

"Terry" <just@say.no> wrote in message
news:og8at0tfs8veatosp6r9i1u1dmb09jqcbb@
4ax.com...
> Hi!
>
> I have several very large HTML files from which I'd like to extract

only
> tables nested at the deepest level. I thought this would be quite

easy
> by extracting something like (<table.*?</table> ) where I'd alter the
> '.*?' to test for and reject any new occurrences of a starting table
> tag, but I can't seem to get it. Any pointers?
>
> I want to deal with the file at a text level until the tables are
> extracted, after which I plan to use HTML::TableContentParser to

extract
> the needed content.


Search the archives of this newsgroup for "HTML" and "Regular
Expression" to see why this is simply a bad idea. Regular expressions
are not sufficient to parse HTML. That is why CPAN contains so many
HTML parsers. Visit http://search.cpan.org and search for HTML, then
try one that looks like it best suits your needs.

Paul Lalli

Tad McClellan

2004-12-31, 3:56 pm

Terry <just@say.no> wrote:


> I have several very large HTML files



How large might "very large" be when you say it?

In my experience, only very poorly designed websites serve what I
would call "very large" HTML files...


> from which I'd like to extract only
> tables nested at the deepest level.



Is this deepest level an arbitrary depth, or is it always the
same depth for this type of HTML file?

The answer to this would have a large impact on the approaches
available.


> I thought this would be quite easy
> by extracting something like (<table.*?</table> )



I guess you've never taken a Formal Methods class then. :-)


> where I'd alter the
> '.*?' to test for and reject any new occurrences of a starting table
> tag, but I can't seem to get it. Any pointers?



Yes, but you have surely already seen them since they are
Questions that are Asked Frequently:

perldoc -q nest

How do I find matching/nesting anything?

perldoc -q HTML

How do I remove HTML from a string?


> I want to deal with the file at a text level until the tables are
> extracted, after which I plan to use HTML::TableContentParser to extract
> the needed content.



I would use HTML::TableExtract.

It gives you the depth via the coords() method.


--
Tad McClellan SGML consulting
tadmc@augustmail.com Perl programming
Fort Worth, Texas
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com