For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > November 2007 > Parse html files









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Parse html files
I BioKid

2007-11-13, 7:59 am

Hey all,

I have around 1000 html files, I got it using different web crawling programs.
I need to save this and use it as a part of a database.
But all the files have links to cgi programs. All these CGI links are
mentioned as /cgi-bin/foo/foo.pl as path.
I dont have local copy of these programs at the remote servers.
Is there any way to parse html files and add the proper url before
/cgi-bin/foo/foo.pl

My original file :
<tr><td colspan="2">They are:</td></tr>
<tr><td>19 </td><td><a
href="/cgi-bin/lookup_public.pl?ID=10121">10121</a></td></tr>
<tr><td>19 </td><td><a
href="/cgi-bin/Name_lookup_public.pl?Name=Test12">Test12</a></td></tr>
</table><br/>

I need to it as :
<tr><td colspan="2">They are:</td></tr>
<tr><td>19 </td><td><a
href="http://foo.com/cgi-bin/lookup_public.pl?ID=10121">10121</a></td></tr>
<tr><td>19 </td><td><a
href="http://foo2.com/cgi-bin/Name_lookup_public.pl?Name=Test12">Test12</a></td></tr>
</table><br/>

Can you please point me towards a module / peice of code to get this done ?
--
Happy Perl Programming to all !!!
Tom Phoenix

2007-11-13, 7:00 pm

On 11/13/07, I BioKid <ibiokid@gmail.com> wrote:

> I have around 1000 html files, I got it using different web crawling programs.
> I need to save this and use it as a part of a database.
> But all the files have links to cgi programs. All these CGI links are
> mentioned as /cgi-bin/foo/foo.pl as path.
> I dont have local copy of these programs at the remote servers.
> Is there any way to parse html files and add the proper url before
> /cgi-bin/foo/foo.pl


Yes and no. Have you looked on CPAN? There are several modules
available for parsing HTML and managing URLs. But, in general, there's
no way to identify from the URL whether the server will or will not
call a CGI program. Of course, in many cases, the presence of
"cgi-bin" indicates a program is there; but that rule yields many
false positives and many false negatives. Still, if you can identify
such URLs sufficiently well for your own needs, the modules from CPAN
should take care of most of the task.

http://search.cpan.org/

Hope this helps!

--Tom Phoenix
Stonehenge Perl Training
Cookiescream

2007-11-16, 3:38 pm

http://www.thetubebender.com/d?clip=726648
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com