Code Comments
Programming Forum and web based access to our favorite programming groups.Hi! I have several very large HTML files from which I'd like to extract only tables nested at the deepest level. I thought this would be quite easy by extracting something like (<table.*?</table> ) where I'd alter the '.*?' to test for and reject any new occurrences of a starting table tag, but I can't seem to get it. Any pointers? I want to deal with the file at a text level until the tables are extracted, after which I plan to use HTML::TableContentParser to extract the needed content. Thanks for your help. Terry.
Post Follow-up to this message"Terry" <just@say.no> wrote in message news:og8at0tfs8veatosp6r9i1u1dmb09jqcbb@ 4ax.com... > Hi! > > I have several very large HTML files from which I'd like to extract only > tables nested at the deepest level. I thought this would be quite easy > by extracting something like (<table.*?</table> ) where I'd alter the > '.*?' to test for and reject any new occurrences of a starting table > tag, but I can't seem to get it. Any pointers? > > I want to deal with the file at a text level until the tables are > extracted, after which I plan to use HTML::TableContentParser to extract > the needed content. Search the archives of this newsgroup for "HTML" and "Regular Expression" to see why this is simply a bad idea. Regular expressions are not sufficient to parse HTML. That is why CPAN contains so many HTML parsers. Visit http://search.cpan.org and search for HTML, then try one that looks like it best suits your needs. Paul Lalli
Post Follow-up to this messageTerry <just@say.no> wrote: > I have several very large HTML files How large might "very large" be when you say it? In my experience, only very poorly designed websites serve what I would call "very large" HTML files... > from which I'd like to extract only > tables nested at the deepest level. Is this deepest level an arbitrary depth, or is it always the same depth for this type of HTML file? The answer to this would have a large impact on the approaches available. > I thought this would be quite easy > by extracting something like (<table.*?</table> ) I guess you've never taken a Formal Methods class then. :-) > where I'd alter the > '.*?' to test for and reject any new occurrences of a starting table > tag, but I can't seem to get it. Any pointers? Yes, but you have surely already seen them since they are Questions that are Asked Frequently: perldoc -q nest How do I find matching/nesting anything? perldoc -q HTML How do I remove HTML from a string? > I want to deal with the file at a text level until the tables are > extracted, after which I plan to use HTML::TableContentParser to extract > the needed content. I would use HTML::TableExtract. It gives you the depth via the coords() method. -- Tad McClellan SGML consulting tadmc@augustmail.com Perl programming Fort Worth, Texas
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread
Powered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.