For Programmers: Free Programming Magazines  


Home > Archive > PERL Miscellaneous > September 2005 > extracting text content from web page









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author extracting text content from web page
kjhjhjhjadsasda@urbanhabit.com

2005-09-28, 6:58 pm

Im trying to write a perl script that in a meaningful way extracts text
content from a webpage. Ive tried through modules and reg expr but
havent found a good way yet.

To avoid "crappy" text slipping through, is there a way of extracting
only sentences? ex:

-clean the html from tags
-extract sentences through identifying number of words between
punctuations or something similar.

Any other ideas on how to nicely pick out content text from a webpage?

Thanks
M

Ron Savage

2005-09-28, 6:58 pm

On Thu, 29 Sep 2005 06:06:03 +1000, kjhjhjhjadsasda@urbanhabit.com wrote:

Hi M

HTML::TokeParser is the one you want. The docs are excellent.

The author has also written a book - Perl & LWP - which I recommend.

Note: Download the list of misprints, though!


Dr.Ruud

2005-09-28, 6:58 pm

kjhjhjhjadsasda@urbanhabit.com schreef:

> Any other ideas on how to nicely pick out content text from a webpage?


HTML::Parser
http://www.gellyfish.com/htexamples/

--
Affijn, Ruud

"Gewoon is een tijger."


kjhjhjhjadsasda@urbanhabit.com

2005-09-30, 3:56 am

Hi Ron

TokeParser is great. However, I still get a lot "menu text" and alt
tags etc. Is there a way to have it only accept "sentence length" text?

What do you mean by download missprints?

Thanks!
M

Ron Savage skrev:

> On Thu, 29 Sep 2005 06:06:03 +1000, kjhjhjhjadsasda@urbanhabit.com wrote:
>
> Hi M
>
> HTML::TokeParser is the one you want. The docs are excellent.
>
> The author has also written a book - Perl & LWP - which I recommend.
>
> Note: Download the list of misprints, though!


Sherm Pendley

2005-09-30, 3:56 am

kjhjhjhjadsasda@urbanhabit.com writes:

Note - upside-down quoting fixed. Please don't do that.

> Ron Savage skrev:
>
>
> What do you mean by download missprints?


Errata. Typos and other errors in a book are often listed, along with the
corrections of course, on a publisher's web site.

For this particular book, the publisher is O'Reilly, and the errata is
listed here:

<http://www.oreilly.com/catalog/perllwp/>

sherm--

--
Cocoa programming in Perl: http://camelbones.sourceforge.net
Hire me! My resume: http://www.dot-app.org
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com