Home > Archive > PERL Miscellaneous > September 2005 > extracting text content from web page
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
extracting text content from web page
|
|
| kjhjhjhjadsasda@urbanhabit.com 2005-09-28, 6:58 pm |
| Im trying to write a perl script that in a meaningful way extracts text
content from a webpage. Ive tried through modules and reg expr but
havent found a good way yet.
To avoid "crappy" text slipping through, is there a way of extracting
only sentences? ex:
-clean the html from tags
-extract sentences through identifying number of words between
punctuations or something similar.
Any other ideas on how to nicely pick out content text from a webpage?
Thanks
M
| |
| Ron Savage 2005-09-28, 6:58 pm |
| On Thu, 29 Sep 2005 06:06:03 +1000, kjhjhjhjadsasda@urbanhabit.com wrote:
Hi M
HTML::TokeParser is the one you want. The docs are excellent.
The author has also written a book - Perl & LWP - which I recommend.
Note: Download the list of misprints, though!
| |
| Dr.Ruud 2005-09-28, 6:58 pm |
| kjhjhjhjadsasda@urbanhabit.com schreef:
> Any other ideas on how to nicely pick out content text from a webpage?
HTML::Parser
http://www.gellyfish.com/htexamples/
--
Affijn, Ruud
"Gewoon is een tijger."
| |
| kjhjhjhjadsasda@urbanhabit.com 2005-09-30, 3:56 am |
| Hi Ron
TokeParser is great. However, I still get a lot "menu text" and alt
tags etc. Is there a way to have it only accept "sentence length" text?
What do you mean by download missprints?
Thanks!
M
Ron Savage skrev:
> On Thu, 29 Sep 2005 06:06:03 +1000, kjhjhjhjadsasda@urbanhabit.com wrote:
>
> Hi M
>
> HTML::TokeParser is the one you want. The docs are excellent.
>
> The author has also written a book - Perl & LWP - which I recommend.
>
> Note: Download the list of misprints, though!
| |
| Sherm Pendley 2005-09-30, 3:56 am |
| kjhjhjhjadsasda@urbanhabit.com writes:
Note - upside-down quoting fixed. Please don't do that.
> Ron Savage skrev:
>
>
> What do you mean by download missprints?
Errata. Typos and other errors in a book are often listed, along with the
corrections of course, on a publisher's web site.
For this particular book, the publisher is O'Reilly, and the errata is
listed here:
<http://www.oreilly.com/catalog/perllwp/>
sherm--
--
Cocoa programming in Perl: http://camelbones.sourceforge.net
Hire me! My resume: http://www.dot-app.org
|
|
|
|
|