For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > August 2007 > parsing HTML content









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author parsing HTML content
Ladder49

2007-08-30, 10:20 pm

Is there a way to dump the HTML code for a web page? I need to write
a script which will collect and summarize content from intranet web
pages. By dump, I mean to read it the same way you would read a file
and parse its contents. Thanks.

Jeff Pang

2007-08-30, 10:20 pm

2007/8/30, ladder49 <richard.ballmann@montgomerycountymd.gov>:
> Is there a way to dump the HTML code for a web page? I need to write
> a script which will collect and summarize content from intranet web
> pages. By dump, I mean to read it the same way you would read a file
> and parse its contents. Thanks.
>


You can use lwp to do this.like,

perl -MLWP::Simple -e '$c=get("http://www.yahoo.cn/");print $c'

see also `perldoc lwpcook`.

--
Jeff Pang - rwwebs@gmail.com
http://www.readwriteweb.com/
Daniel Kasak

2007-08-30, 10:20 pm

On Thu, 2007-08-30 at 07:16 -0700, ladder49 wrote:

> Is there a way to dump the HTML code for a web page? I need to write
> a script which will collect and summarize content from intranet web
> pages. By dump, I mean to read it the same way you would read a file
> and parse its contents. Thanks.


I use LWP::Simple to fetch stuff, and HTML::TreeBuilder to parse it and
extract stuff.

--
Daniel Kasak
IT Developer
NUS Consulting Group
Level 5, 77 Pacific Highway
North Sydney, NSW, Australia 2060
T: (+61) 2 9922-7676 / F: (+61) 2 9922 7989
email: dkasak@nusconsulting.com.au
website: http://www.nusconsulting.com.au

Alan_C

2007-08-31, 4:29 am

Ladder49 wrote:

> Is there a way to dump the HTML code for a web page? I need to write
> a script which will collect and summarize content from intranet web
> pages. By dump, I mean to read it the same way you would read a file
> and parse its contents. Thanks.


Strip? As in remove (or strip) the html markup, leaving only the text
content?

my $url = 'http://osuosl.org/';
print `lynx -dump $url`; # if on Linux with Lynx
# ------

http://groups.google.com/group/perl...df13319c51bf868

At bottom of page @ there uses a module to strip html

http://groups.google.com/group/perl...arch+this+group

More on strip html

--
Alan.

Ladder49

2007-08-31, 7:27 pm

On Aug 30, 9:37 pm, dka...@nusconsulting.com.au (Daniel Kasak) wrote:
> On Thu, 2007-08-30 at 07:16 -0700, ladder49 wrote:
>
> I use LWP::Simple to fetch stuff, and HTML::TreeBuilder to parse it and
> extract stuff.
>
> --
> Daniel Kasak
> IT Developer
> NUS Consulting Group
> Level 5, 77 Pacific Highway
> North Sydney, NSW, Australia 2060
> T: (+61) 2 9922-7676 / F: (+61) 2 9922 7989
> email: dka...@nusconsulting.com.au
> website:http://www.nusconsulting.com.au


Daniel, Jeff,

Thanks for your replies. Your pointing me to LWP got me started and
now I've got a working script. Thanks again!

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com