Home > Archive > PERL Beginners > August 2007 > parsing HTML content
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
parsing HTML content
|
|
| Ladder49 2007-08-30, 10:20 pm |
| Is there a way to dump the HTML code for a web page? I need to write
a script which will collect and summarize content from intranet web
pages. By dump, I mean to read it the same way you would read a file
and parse its contents. Thanks.
| |
| Jeff Pang 2007-08-30, 10:20 pm |
| 2007/8/30, ladder49 <richard.ballmann@montgomerycountymd.gov>:
> Is there a way to dump the HTML code for a web page? I need to write
> a script which will collect and summarize content from intranet web
> pages. By dump, I mean to read it the same way you would read a file
> and parse its contents. Thanks.
>
You can use lwp to do this.like,
perl -MLWP::Simple -e '$c=get("http://www.yahoo.cn/");print $c'
see also `perldoc lwpcook`.
--
Jeff Pang - rwwebs@gmail.com
http://www.readwriteweb.com/
| |
| Daniel Kasak 2007-08-30, 10:20 pm |
| On Thu, 2007-08-30 at 07:16 -0700, ladder49 wrote:
> Is there a way to dump the HTML code for a web page? I need to write
> a script which will collect and summarize content from intranet web
> pages. By dump, I mean to read it the same way you would read a file
> and parse its contents. Thanks.
I use LWP::Simple to fetch stuff, and HTML::TreeBuilder to parse it and
extract stuff.
--
Daniel Kasak
IT Developer
NUS Consulting Group
Level 5, 77 Pacific Highway
North Sydney, NSW, Australia 2060
T: (+61) 2 9922-7676 / F: (+61) 2 9922 7989
email: dkasak@nusconsulting.com.au
website: http://www.nusconsulting.com.au
| |
| Alan_C 2007-08-31, 4:29 am |
| Ladder49 wrote:
> Is there a way to dump the HTML code for a web page? I need to write
> a script which will collect and summarize content from intranet web
> pages. By dump, I mean to read it the same way you would read a file
> and parse its contents. Thanks.
Strip? As in remove (or strip) the html markup, leaving only the text
content?
my $url = 'http://osuosl.org/';
print `lynx -dump $url`; # if on Linux with Lynx
# ------
http://groups.google.com/group/perl...df13319c51bf868
At bottom of page @ there uses a module to strip html
http://groups.google.com/group/perl...arch+this+group
More on strip html
--
Alan.
| |
| Ladder49 2007-08-31, 7:27 pm |
| On Aug 30, 9:37 pm, dka...@nusconsulting.com.au (Daniel Kasak) wrote:
> On Thu, 2007-08-30 at 07:16 -0700, ladder49 wrote:
>
> I use LWP::Simple to fetch stuff, and HTML::TreeBuilder to parse it and
> extract stuff.
>
> --
> Daniel Kasak
> IT Developer
> NUS Consulting Group
> Level 5, 77 Pacific Highway
> North Sydney, NSW, Australia 2060
> T: (+61) 2 9922-7676 / F: (+61) 2 9922 7989
> email: dka...@nusconsulting.com.au
> website:http://www.nusconsulting.com.au
Daniel, Jeff,
Thanks for your replies. Your pointing me to LWP got me started and
now I've got a working script. Thanks again!
|
|
|
|
|