Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

web crawling program
Hi,

I am trying to make 1000 searches from a site from a keyword file. I
want to automate these searches. I have copied the  search form of
site and modified the post part of url so i can make necessary
modifications for the automation.

// structure of form
<form post "http://www.XXXXXXXX.com/processform.php">
select box>// a selectbox with 100 options
input box
</form>

Program will be as follows

While not end of file
{
1st step: Read a keyword from the file and assign it $value
For each option $op on the form
//Fill form
2nd step:    $op is selected
3rd step:     enter $value  to inputbox
4:    submit

5:   save the result to a file, $result
6:    parse $result and save to database
}
There is no problem until step 5.When i fill and submit the
form a page is opened from the remote host and contains the table

//table

data1-- data2 -- data3

data1 and data2 is text and data 3 is  a link

and i want to save data1 data2 and data3link to a database.

but when the control is gone to formprocessor on the remote host i
have no idea how to complete step 5. I mean how can i gain control and
save the result to a file and parse it.

Also this is the way i can to think of but necesserly the feasible
solution. If you can think of  other options, i am all ears :)

Thank you very much for your kind response.




Report this thread to moderator Post Follow-up to this message
Old Post
raven
04-01-08 01:04 AM


Re: web crawling program
raven wrote:
> Hi,
>
> I am trying to make 1000 searches from a site from a keyword file. I
> want to automate these searches. I have copied the  search form of
> site and modified the post part of url so i can make necessary
> modifications for the automation.
>
> // structure of form
> <form post "http://www.XXXXXXXX.com/processform.php">
> select box>// a selectbox with 100 options
> input box
> </form>
>
> Program will be as follows
>
> While not end of file
> {
>      1st step: Read a keyword from the file and assign it $value
>       For each option $op on the form
>           //Fill form
>       2nd step:    $op is selected
>       3rd step:     enter $value  to inputbox
>       4:    submit
>
>       5:   save the result to a file, $result
>       6:    parse $result and save to database
> }
>          There is no problem until step 5.When i fill and submit the
> form a page is opened from the remote host and contains the table
>
> //table
>
> data1-- data2 -- data3
>
> data1 and data2 is text and data 3 is  a link
>
> and i want to save data1 data2 and data3link to a database.
>
> but when the control is gone to formprocessor on the remote host i
> have no idea how to complete step 5. I mean how can i gain control and
> save the result to a file and parse it.
>
> Also this is the way i can to think of but necesserly the feasible
> solution. If you can think of  other options, i am all ears :)
>
> Thank you very much for your kind response.
>
>
>
>

You will need to use CURL or similar to submit the form so you can get
the information back.

And BTW - do you have permission to do this?  If I saw someone doing
this on one of my sites, they'd be blocked immediately - if not sooner.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================


Report this thread to moderator Post Follow-up to this message
Old Post
Jerry Stuckle
04-01-08 01:05 AM


Re: web crawling program
Or you could just make a textfile containing all the URLs and pass it
to WGET.
Or use Jerry's approach with CURL.

CURL would be my first approach since it's built-in, but you could use
WGET as well.

Report this thread to moderator Post Follow-up to this message
Old Post
George Maicovschi
04-01-08 01:05 AM


Re: web crawling program
Thank you Jerry and George for your quick responses. The 100 options
in the forms corresponds the cities and the form in the site doesn't
allow you to make a query without selecting the city first. Since I
have no idea for city information(actually i am searching it) a
specific query takes 50 average submit to find and i have over 100
queries. Be it because of the mistake of the remote site designer or
me being evil:( it is unbearable to proceed one by one by hand.

Report this thread to moderator Post Follow-up to this message
Old Post
raven
04-01-08 01:05 AM


Re: web crawling program
On Mon, 31 Mar 2008 07:14:54 -0700 (PDT), raven <rvnsnest@gmail.com>
wrote:

>Thank you Jerry and George for your quick responses. The 100 options
>in the forms corresponds the cities and the form in the site doesn't
>allow you to make a query without selecting the city first. Since I
>have no idea for city information(actually i am searching it) a
>specific query takes 50 average submit to find and i have over 100
>queries. Be it because of the mistake of the remote site designer or
>me being evil:( it is unbearable to proceed one by one by hand.


My bet is the site will figure out that you're botting them after
about the first 50 and shut you down.   Ever thought of just asking
them for the data?
--
gburnore at DataBasix dot Com
---------------------------------------------------------------------------
How you look depends on where you go.
---------------------------------------------------------------------------
Gary L. Burnore                       |  ÝÛ³ºÝ³Þ³ºÝ³³Ýۺݳ޳ºÝ³Ý³Þ³ºÝ³ÝÝÛ³
|  ÝÛ³ºÝ³Þ³ºÝ³³Ýۺݳ޳ºÝ³Ý³Þ³ºÝ³ÝÝÛ³
Official .sig, Accept no substitutes. |  ÝÛ³ºÝ³Þ³ºÝ³³Ýۺݳ޳ºÝ³Ý³Þ³ºÝ³ÝÝÛ³
|  ÝÛ 0 1 7 2 3 / Ý³Þ 3 7 4 9 3 0 Û³
Black Helicopter Repair Services, Ltd.|     Official Proof of Purchase
 ========================================
===================================

Report this thread to moderator Post Follow-up to this message
Old Post
Gary L. Burnore
04-01-08 01:07 AM


Re: web crawling program
raven wrote:
> Thank you Jerry and George for your quick responses. The 100 options
> in the forms corresponds the cities and the form in the site doesn't
> allow you to make a query without selecting the city first. Since I
> have no idea for city information(actually i am searching it) a
> specific query takes 50 average submit to find and i have over 100
> queries. Be it because of the mistake of the remote site designer or
> me being evil:( it is unbearable to proceed one by one by hand.
>

Then you have a problem.  If their webmaster is paying any attention at
all, you'll be in deep trouble with him (and most probably the site
owner, if they aren't the same people).

I wouldn't recommend it.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================


Report this thread to moderator Post Follow-up to this message
Old Post
Jerry Stuckle
04-01-08 01:09 AM


Re: web crawling program
On Mar 31, 6:25 pm, Gary L. Burnore <gburn...@databasix.com> wrote:
> On Mon, 31 Mar 2008 07:14:54 -0700 (PDT), raven <rvnsn...@gmail.com>
> wrote:
> 
>
> My bet is the site will figure out that you're botting them after
> about the first 50 and shut you down.   Ever thought of just asking
> them for the data?
> --

Well, in order to go about this you should do the following things:

1. Choose a user-agent to emulate (Microsoft Internet Explorer or
Firefox)
2. Choose a random request time so that you don't send requests all
the time.

Both this options are available in CURL and also WGET so you could use
any of them. And even might want to do it from more than one IP.
That's my opinion, if you need any more help on spidering the data
just drop me an email.

Regards,
George Maicovschi.

Report this thread to moderator Post Follow-up to this message
Old Post
George Maicovschi
04-01-08 01:12 AM


Re: web crawling program
thank you all.I will make a delay between queries like 10 seconds so
it will not consume remote site bandwith.but there is another problem
with curl.

The remote site needs creditentals to access.Normally I was opening
site with firefox, entering credidentals and after that I was
completing the form from the local copy and it was working.but when i
used curl for posting the site directs me to credidentals page.
I have made trivial post form and curl handle it very well so i think
my mistake isn't about curl.Things I have tried

1-I thought maybe the host knows request isn't coming  from a html
form so i used user-agent as George suggested
$userAgent = 'Firefox (WindowsXP) - Mozilla/5.0 (Windows; U; Windows
NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6';
curl_setopt($Curl_Session, CURLOPT_USERAGENT, $userAgent);
also i used google& yahoo crawler id's as suggested elsewhere
didn't worked

2-I thought  web page uses cookie based session management so somehow
directly using curl submit doesnt carry session variables but
webdeveloper addon for mozilla does not show any

3-I used a intermediate processor and tried to acces
curl_setopt ($Curl_Session, CURLOPT_POSTFIELDS, $_POST);
with no luck

I don't get how a local form submit works but a curl submit do not.I
used curl for other form completion tasks and everywhere  it seems ok.
Is there a way the form processor knows it is coming from curl not
from a html form and rejects it? or alternatively is there a way i
can  use local form and *somehow* gain control of generated page?

Report this thread to moderator Post Follow-up to this message
Old Post
raven
04-02-08 12:07 AM


Re: web crawling program
"Jerry Stuckle" <jstucklex@attglobal.net> wrote in message
 news:U_udnbJBmK8ikmzanZ2dnUVZ_uXinZ2d@co
mcast.com...
> raven wrote: 
>
> Then you have a problem.  If their webmaster is paying any attention
> at all, you'll be in deep trouble with him (and most probably the
> site owner, if they aren't the same people).
>
> I wouldn't recommend it.
>

Jerry,
I always thought programmers were supposed to automate repetitive
tasks.
I see nothing wrong with this, as long as it "friendly fire" ;)

R.




Report this thread to moderator Post Follow-up to this message
Old Post
Richard
04-02-08 12:07 AM


Re: web crawling program
Richard wrote:
> "Jerry Stuckle" <jstucklex@attglobal.net> wrote in message
>  news:U_udnbJBmK8ikmzanZ2dnUVZ_uXinZ2d@co
mcast.com... 
>
> Jerry,
> I always thought programmers were supposed to automate repetitive
> tasks.
> I see nothing wrong with this, as long as it "friendly fire" ;)
>
> R.
>
>
>
>

What's wrong with it is he's using someone else's information and bandwidth.

For instance, if the information is copyrighted, he could be in serious
legal trouble.  Even if it isn't copyrighted, the owner may not like the
way he's using their website.

Automating repetitive tasks is fine when it's your resources.  But when
you're using someone else's resources, you need to pay attention to what
they allow.

The whole thing could get him in serious legal trouble if he doesn't
have permission to do what he wants.  And if the owner of the site
wanted to press it, it could cost the op a LOT of money.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================


Report this thread to moderator Post Follow-up to this message
Old Post
Jerry Stuckle
04-02-08 03:03 AM


Sponsored Links




Last Thread Next Thread Next
Pages (3): [1] 2 3 »
Search this forum -> 
Post New Thread

PHP Programming archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 06:23 PM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.