I am scraping a website in PHP via a cronjob quite frequently for about 2 years now and did not have a problem so far. Yesterday however, the mechanism broke, because the HTML code I scrapped with file_get_contents was now empty. The basic code is this:
<?php
$url = "http://..."; // some url
$context = stream_context_create(array('http'=>
array(
'timeout' => 10 // for reasons this shall be short
)
));
$html = file_get_contents($url,false,$context);
if( $html == false ){
// some error handling
}
if( empty($html) ){
// some error handling
}
First the empty($html) did not match because the $html == false
error handling went of. Looking into the documentation I noticed “This
function may return Boolean FALSE, but may also return a non-Boolean
value which evaluates to FALSE”. Classic mistake by me, for I used
== and therefore the return value (which was empty) evaluated to
FALSE. After changing to === the second error handling confirmed
that HTML read really was an empty string. This is in fact quite odd,
for opening the URL in a browser I did see the website.
Checking the site via an HTML validator I
realized that the site suddenly lacked the <!DOCTYPE HTML SYSTEM>
header. I honestly do not know why, but this was a problem for
file_get_contents seems to refuse reading the HTML code if this fails
to exist.
I did not find any useful workaround (or simply a related problem in fact) on Google, but in the end I did come up with one myself. Out of sheer desperation I tried to provide a User agent with my request:
<?php
$context = stream_context_create(array('http'=>
array(
'timeout' => 10,
'header' => "User-Agent: Mozilla/5.0 (iPad; U;"
. " CPU OS 3_2 like Mac OS X; en-us)"
. " AppleWebKit/531.21.10 (KHTML, like Gecko)"
. " Version/4.0.4 Mobile/7B334b"
. " Safari/531.21.102011-10-16 20:23:10\r\n"
)
));
$html = file_get_contents($url,false,$context);
and tada! I worked. Not immediately though, for most user agents still do not suffice. I had to fake being an iPad before the web server in question agreed to serve me something more than just emptiness. Why I do not know though…