Web scraping problem with PHP

2015-04-20

I am scraping a website in PHP via a cronjob quite frequently for about 2 years now and did not have a problem so far. Yesterday however, the mechanism broke, because the HTML code I scrapped with file_get_contents was now empty. The basic code is this:

<?php
$url = "http://..."; // some url

$context = stream_context_create(array('http'=>
    array(
        'timeout' => 10 // for reasons this shall be short
    )
));

$html = file_get_contents($url,false,$context);

if( $html == false ){
    // some error handling
}

if( empty($html) ){
    // some error handling
}

First the empty($html) did not match because the $html == false error handling went of. Looking into the documentation I noticed “This function may return Boolean FALSE, but may also return a non-Boolean value which evaluates to FALSE”. Classic mistake by me, for I used == and therefore the return value (which was empty) evaluated to FALSE. After changing to === the second error handling confirmed that HTML read really was an empty string. This is in fact quite odd, for opening the URL in a browser I did see the website.

Checking the site via an HTML validator I realized that the site suddenly lacked the <!DOCTYPE HTML SYSTEM> header. I honestly do not know why, but this was a problem for file_get_contents seems to refuse reading the HTML code if this fails to exist.

I did not find any useful workaround (or simply a related problem in fact) on Google, but in the end I did come up with one myself. Out of sheer desperation I tried to provide a User agent with my request:

<?php
$context = stream_context_create(array('http'=>
    array(
        'timeout' => 10,
        'header' => "User-Agent: Mozilla/5.0 (iPad; U;"
            . " CPU OS 3_2 like Mac OS X; en-us)"
            . " AppleWebKit/531.21.10 (KHTML, like Gecko)"
            . " Version/4.0.4 Mobile/7B334b"
            . " Safari/531.21.102011-10-16 20:23:10\r\n"
    )
));

$html = file_get_contents($url,false,$context);

and tada! I worked. Not immediately though, for most user agents still do not suffice. I had to fake being an iPad before the web server in question agreed to serve me something more than just emptiness. Why I do not know though…