For Programmers: Free Programming Magazines  


Home > Archive > PHP Programming > March 2008 > Can a website block the use of file_get_contents ?









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Can a website block the use of file_get_contents ?
postseb

2008-03-28, 8:02 am

Can a website block the use of file_get_contents ?

Example : file_get_contents("http://www.google.com") works fine, but
file_get_contents("http://www.petitscailloux.com/Follow.aspx?
sUrl=http://www.seloger.com/199986/16271207/detail.htm") does not.

Any clues or ways to circumvent ?

Thanks a lot !
Jan Thomä

2008-03-28, 8:02 am

postseb wrote:
> Can a website block the use of file_get_contents ?


It can not, usually. However the site may set a cookie when you log into it
and a lot of other stuff, like opening sessions etc. Since
file_get_contents isn't exactly a browser replacement, it can very well be
that things that work in the browser, do not work when just calling
file_get_contents. You would have to analyze the requests and responses,
look out for set cookies, session-ids etc, and then replicate this in your
PHP call. You will have to use fsockopen for this kind of stuff. Look at
the PHP manual for fsockopen on how to download a HTTP-page with this
function there is an example right there..

Jan

--
________________________________________
_________________________________
insOMnia - We never sleep...
http://www.insOMnia-hq.de

PaulB

2008-03-28, 8:02 am

"postseb" <postseb@gmail.com> wrote in message
news:2806192b-4a79-4238-9c6b-83977b270813@s50g2000hsb.googlegroups.com...
> Can a website block the use of file_get_contents ?
>
> Example : file_get_contents("http://www.google.com") works fine, but
> file_get_contents("http://www.petitscailloux.com/Follow.aspx?
> sUrl=http://www.seloger.com/199986/16271207/detail.htm") does not.
>
> Any clues or ways to circumvent ?


http://scriptasy.com/php_11/tutorial-curl-login_44.html

function curl_login($url,$data,$proxy,$proxystatu
s){
$fp = fopen("cookie.txt", "w");
fclose($fp);
$login = curl_init();
curl_setopt($login, CURLOPT_COOKIEJAR, "cookie.txt");
curl_setopt($login, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($login, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE
5.01; Windows NT 5.0)");
curl_setopt($login, CURLOPT_TIMEOUT, 40);
curl_setopt($login, CURLOPT_RETURNTRANSFER, TRUE);
if ($proxystatus == 'on') {
curl_setopt($login, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($login, CURLOPT_HTTPPROXYTUNNEL, TRUE);
curl_setopt($login, CURLOPT_PROXY, $proxy);
}
curl_setopt($login, CURLOPT_URL, $url);
curl_setopt($login, CURLOPT_HEADER, TRUE);
curl_setopt($login, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($login, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($login, CURLOPT_POST, TRUE);
curl_setopt($login, CURLOPT_POSTFIELDS, $data);
ob_start(); // prevent any output
return curl_exec ($login); // execute the curl command
ob_end_clean(); // stop preventing output
curl_close ($login);
unset($login);
}

function curl_grab_page($site,$proxy,$proxystatus
){
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
if ($proxystatus == 'on') {
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, TRUE);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
}
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($ch, CURLOPT_URL, $site);
ob_start(); // prevent any output
return curl_exec ($ch); // execute the curl command
ob_end_clean(); // stop preventing output
curl_close ($ch);
}

This is utterly brilliant, and got me screen scraping in no time.

Paul


C. (http://symcbean.blogspot.com/)

2008-03-28, 8:02 am

On 28 Mar, 10:03, postseb <post...@gmail.com> wrote:
> Can a website block the use of file_get_contents ?
>
> Example : file_get_contents("http://www.google.com") works fine, but
> file_get_contents("http://www.petitscailloux.com/Follow.aspx?
> sUrl=http://www.seloger.com/199986/16271207/detail.htm") does not.
>
> Any clues or ways to circumvent ?
>



Well, its not a valid URL for starters - you should urlencode
everything after the 'sURL=' and lose the white space in front.

If that still does not work, try using curl with a faked user agent -
maybe they serve up different content to different browsers.

But beware - if the remote site has anti-leech functionality you
should respect the publishers constraints.

C.

postseb

2008-03-28, 7:08 pm

>
> This is utterly brilliant, and got me screen scraping in no time.
>
> Paul



Thanks Paul and C. - I tried it with curl as well, using the
curl_grap_page and curl with an ini_set of a generic user agent, but I
got the following error :
Thanks also to Jan, I will also have to try fsockopen.

Runtime Error
Description: An application error occurred on the server. The current
custom error settings for this application prevent the details of the
application error from being viewed remotely (for security reasons).
It could, however, be viewed by browsers running on the local server
machine.

Details: To enable the details of this specific error message to be
viewable on remote machines, please create a <customErrors> tag within
a "web.config" configuration file located in the root directory of the
current web application. This <customErrors> tag should then have its
"mode" attribute set to "Off".

<!-- Web.Config Configuration File -->
<configuration>
<system.web>
<customErrors mode="Off"/>
</system.web>
</configuration>

Notes: The current error page you are seeing can be replaced by a
custom error page by modifying the "defaultRedirect" attribute of the
application's <customErrors> configuration tag to point to a custom
error page URL.

<!-- Web.Config Configuration File -->
<configuration>
<system.web>
<customErrors mode="RemoteOnly" defaultRedirect="mycustompage.htm"/>
</system.web>
</configuration>
petersprc

2008-03-28, 10:06 pm

Hi,

This site has user agent detection. Change your UA string to a well-
known one:

ini_set('user_agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB;
rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11');

Then you can download the page.

Regards,

John Peters

On Mar 28, 6:03 am, postseb <post...@gmail.com> wrote:
> Can a website block the use of file_get_contents ?
>
> Example : file_get_contents("http://www.google.com") works fine, but
> file_get_contents("http://www.petitscailloux.com/Follow.aspx?
> sUrl=http://www.seloger.com/199986/16271207/detail.htm") does not.
>
> Any clues or ways to circumvent ?
>
> Thanks a lot !


NC

2008-03-28, 10:06 pm

On Mar 28, 3:03 am, postseb <post...@gmail.com> wrote:
>
> Can a website block the use of file_get_contents ?


I've seen this happen (in particular when trying to read data off of
ASP-based Web sites), although I don't know why it happens. Either
PHP file system functions generate weird HTTP request headers or some
HTTP servers generate weird response headers...

> Example : file_get_contents("http://www.google.com") works fine, but
> file_get_contents("http://www.petitscailloux.com/Follow.aspx?
> sUrl=http://www.seloger.com/199986/16271207/detail.htm") does not.
>
> Any clues or ways to circumvent ?


Use cURL of write a data retrieval function using sockets:

http://groups.google.com/group/comp...1ae1757ad369ace

Cheers,
NC
postseb

2008-03-29, 8:02 am

@petersprc : I did indeed also try with a generic user agent, and I
managed to download the page BUT some values on the page retrieved
where different from the values seen on the webpage itself when simply
browsing it and not trying to retrieve it. Take a look at the value to
the right of "Nombre de jours" which seems to be randomly generated
when retrieving the page and in fact a static value when browsing the
page. How can that be, very strange ? I am surprised the contents
could be retrieved but with a random modification of particular values
within the page ?
Thank you already for your help.

@NC : yes I did try curl but got the error message mentioned above. I
will try sockets as well.
Thank you already for your help as well !
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com