For Programmers: Free Programming Magazines  


Home > Archive > PHP Programming > October 2006 > HELP: strange php behavior downloading html









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author HELP: strange php behavior downloading html
Chuck Renner

2006-10-30, 7:04 pm

Please help!

This MIGHT even be a bug in PHP!

I'll provide version numbers and site specific information (browser, OS,
and kernel versions) if others cannot reproduce this problem.

I'm running into some PHP behavior that I do not understand in PHP 5.1.2.

I need to parse the HTML from the following carefully constructed URI:
http://crenner.smugmug.com/homepage...gallery/1960121

The problem is that when PHP downloads the HTML using file_get_contents,
or any other method of opening a remote file in PHP that I have tried,
it gives me the wrong page!

This URI is supposed to yield the HTML from the page at
http://crenner.smugmug.com/gallery/1960121 , but with the "allthumbs"
version of the page, selectable from the dropdown box at the top of the
page.

The correct page is downloaded in IE, SeaMonkey, and in wget!

But when downloading in PHP, I get the HTML from the page at
http://crenner.smugmug.com/gallery/1960121 , but with the "smugmug
small" version of the page, selectable from the dropdown box at the top
of the page.

Please note that the templatechange.mg page is merely a server-side
script that takes the arguments passed to it (TemplateID and origin),
and redirects the browser to the correct version of the page at
"origin", based on the "TemplateID".

Here is how to reproduce the problem:
* Download the page with wget so that you have a copy of the correct
results:

--commandline start here--
wget
"http://crenner.smugmug.com/homepage/templatechange.mg?TemplateID=7&origin=http://crenner.smugmug.com/gallery/1960121"
-O correct.html
--commandline end here--

* Download the same page with php 5.1.2:

--file incorrect.php start here--
<?php
print(file_get_contents("http://crenner.smugmug.com/homepage/templatechange.mg?TemplateID=7&origin=http://crenner.smugmug.com/gallery/1960121"));
?>
--file incorrect.php end here--

--commandline start here--
php incorrect.php > incorrect.html
--commandline end here--

* You should now have two very different HTML files (correct.html and
incorrect.html), even though both were downloaded using the same URI!

* Open correct.html in a web browser. You will see a thumbnails
("allthumbs") only version of a smugmug.com picture gallery.

* Open incorrect.html in a web browser. You will see a paginated
version of the same smugmug.com picture gallery ("smugmug small"), with
a larger image on the right.

I know that I could make a workaround by having my PHP scripts call wget
instead of using intrinsic functions to download the HTML. This is not
practical for me for a number of reasons, including code portability and
streamlining.

Can anyone help me with this? I know that the templatechange.mg uses a
302 to redirect the browser, based on the output I get from wget. I
also know that the redirect is happening in PHP (even if it is happening
incorrectly), because I'm not getting the contents of the
templatechange.mg file, but a different version of the gallery itself.

This is driving me crazy. I can find no logical reason why PHP would
yield different results for the same URI than I get in 3 other browsers
(SeaMonkey, IE, and wget).

I have also attached the results pages and the php script (correct.html,
incorrect.html, and incorrect.php) in php_download_strangeness.tar.bz2
(a bzip2 compressed tar archive)

- Chuck Renner



Chuck Renner

2006-10-30, 7:04 pm

Thanks Rik for pointing out that the HTTP headers on that redirected
page were setting and using cookies and for pointing me in the right
direction with cURL.

I was able to yield a correctly working result for my HTML downloading
problem in less than an hour, using cURL with PHP.

With the function I have below, I just call tempnam() to give me a
temporary filename, call my function with the uri and the results from
tempnam(), and then read the file with file_get_contents(). I then can
delete the file with unlink().

Here is the function I wrote to download a uri into a file (following
all redirects, ignoring old cookies, and passing set cookies to redirects):
<?php
function uri_download($uri, $fileName) {
// use cURL to download uri
// make a curl resource, setting the uri as it's target to open
$curl = curl_init($uri);
// make a file resource and create/empty the file for writing
$hFile = fopen($fileName, "w+");
// set curl options
// set the file resource that curl will write to
curl_setopt($curl, CURLOPT_FILE, $hFile);
// do not let curl output the HTTP headers
curl_setopt($curl, CURLOPT_HEADER, false);
// let curl follow redirects
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
// set a location for curl to handle cookies
curl_setopt($curl, CURLOPT_COOKIEJAR, "/tmp");
// tell curl to mark this as a new cookie session
curl_setopt($curl, CURLOPT_COOKIESESSION, true);
// execute curl (download the uri to the temp file)
curl_exec($curl);
// close the curl resource
curl_close($curl);
// unset the curl resource
unset($curl);
// close the temp file and file resource
fclose($hFile);
// unset the file resource
unset($hFile);
}
?>

Chuck Renner wrote:
> Please help!
>
> This MIGHT even be a bug in PHP!
>
> I'll provide version numbers and site specific information (browser, OS,
> and kernel versions) if others cannot reproduce this problem.
>
> I'm running into some PHP behavior that I do not understand in PHP 5.1.2.
>
> I need to parse the HTML from the following carefully constructed URI:
> http://crenner.smugmug.com/homepage...gallery/1960121
>
> The problem is that when PHP downloads the HTML using file_get_contents,
> or any other method of opening a remote file in PHP that I have tried,
> it gives me the wrong page!
>
> This URI is supposed to yield the HTML from the page at
> http://crenner.smugmug.com/gallery/1960121 , but with the "allthumbs"
> version of the page, selectable from the dropdown box at the top of the
> page.
>
> The correct page is downloaded in IE, SeaMonkey, and in wget!
>
> But when downloading in PHP, I get the HTML from the page at
> http://crenner.smugmug.com/gallery/1960121 , but with the "smugmug
> small" version of the page, selectable from the dropdown box at the top
> of the page.
>
> Please note that the templatechange.mg page is merely a server-side
> script that takes the arguments passed to it (TemplateID and origin),
> and redirects the browser to the correct version of the page at
> "origin", based on the "TemplateID".
>
> Here is how to reproduce the problem:
> * Download the page with wget so that you have a copy of the correct
> results:
>
> --commandline start here--
> wget
> "http://crenner.smugmug.com/homepage/templatechange.mg?TemplateID=7&origin=http://crenner.smugmug.com/gallery/1960121"
> -O correct.html
> --commandline end here--
>
> * Download the same page with php 5.1.2:
>
> --file incorrect.php start here--
> <?php
> print(file_get_contents("http://crenner.smugmug.com/homepage/templatechange.mg?TemplateID=7&origin=http://crenner.smugmug.com/gallery/1960121"));
> ?>
> --file incorrect.php end here--
>
> --commandline start here--
> php incorrect.php > incorrect.html
> --commandline end here--
>
> * You should now have two very different HTML files (correct.html and
> incorrect.html), even though both were downloaded using the same URI!
>
> * Open correct.html in a web browser. You will see a thumbnails
> ("allthumbs") only version of a smugmug.com picture gallery.
>
> * Open incorrect.html in a web browser. You will see a paginated
> version of the same smugmug.com picture gallery ("smugmug small"), with
> a larger image on the right.
>
> I know that I could make a workaround by having my PHP scripts call wget
> instead of using intrinsic functions to download the HTML. This is not
> practical for me for a number of reasons, including code portability and
> streamlining.
>
> Can anyone help me with this? I know that the templatechange.mg uses a
> 302 to redirect the browser, based on the output I get from wget. I
> also know that the redirect is happening in PHP (even if it is happening
> incorrectly), because I'm not getting the contents of the
> templatechange.mg file, but a different version of the gallery itself.
>
> This is driving me crazy. I can find no logical reason why PHP would
> yield different results for the same URI than I get in 3 other browsers
> (SeaMonkey, IE, and wget).
>
> I have also attached the results pages and the php script (correct.html,
> incorrect.html, and incorrect.php) in php_download_strangeness.tar.bz2
> (a bzip2 compressed tar archive)
>
> - Chuck Renner
>
>

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com