Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

Extracting nested tables from HTML
Hi!

I have several very large HTML files from which I'd like to extract only
tables nested at the deepest level.  I thought this would be quite easy
by extracting  something like (<table.*?</table> ) where I'd alter the
'.*?' to test for and reject any new occurrences of a starting table
tag, but I can't seem to get it.  Any pointers?

I want to deal with the file at a text level until the tables are
extracted, after which I plan to use HTML::TableContentParser to extract
the needed content.

Thanks for your help.

Terry.

Report this thread to moderator Post Follow-up to this message
Old Post
Terry
12-31-04 01:56 PM


Re: Extracting nested tables from HTML
"Terry" <just@say.no> wrote in message
 news:og8at0tfs8veatosp6r9i1u1dmb09jqcbb@
4ax.com...
> Hi!
>
> I have several very large HTML files from which I'd like to extract
only
> tables nested at the deepest level.  I thought this would be quite
easy
> by extracting  something like (<table.*?</table> ) where I'd alter the
> '.*?' to test for and reject any new occurrences of a starting table
> tag, but I can't seem to get it.  Any pointers?
>
> I want to deal with the file at a text level until the tables are
> extracted, after which I plan to use HTML::TableContentParser to
extract
> the needed content.

Search the archives of this newsgroup for "HTML" and "Regular
Expression" to see why this is simply a bad idea.  Regular expressions
are not sufficient to parse HTML.  That is why CPAN contains so many
HTML parsers.   Visit http://search.cpan.org and search for HTML, then
try one that looks like it best suits your needs.

Paul Lalli


Report this thread to moderator Post Follow-up to this message
Old Post
Paul Lalli
12-31-04 08:56 PM


Re: Extracting nested tables from HTML
Terry <just@say.no> wrote:


> I have several very large HTML files


How large might "very large" be when you say it?

In my experience, only very poorly designed websites serve what I
would call "very large" HTML files...


> from which I'd like to extract only
> tables nested at the deepest level.


Is this deepest level an arbitrary depth, or is it always the
same depth for this type of HTML file?

The answer to this would have a large impact on the approaches
available.


> I thought this would be quite easy
> by extracting  something like (<table.*?</table> )


I guess you've never taken a Formal Methods class then.  :-)


> where I'd alter the
> '.*?' to test for and reject any new occurrences of a starting table
> tag, but I can't seem to get it.  Any pointers?


Yes, but you have surely already seen them since they are
Questions that are Asked Frequently:

perldoc -q nest

How do I find matching/nesting anything?

perldoc -q HTML

How do I remove HTML from a string?


> I want to deal with the file at a text level until the tables are
> extracted, after which I plan to use HTML::TableContentParser to extract
> the needed content.


I would use HTML::TableExtract.

It gives you the depth via the coords() method.


--
Tad McClellan                          SGML consulting
tadmc@augustmail.com                   Perl programming
Fort Worth, Texas

Report this thread to moderator Post Follow-up to this message
Old Post
Tad McClellan
12-31-04 08:56 PM


Sponsored Links




Last Thread Next Thread Next
Search this forum -> 
Post New Thread

PERL Miscellaneous archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 07:40 PM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.