cURL or file_get_contents to grab a page?

Nostalgae

New member
Jul 20, 2009
221
4
0
I'm about to play with this but I don't think I'm going to have time right now.

Currently I'm grabbing a web page using file_get_contents() and it takes a while for it to load -- would I see a quicker response if I use cURL to grab the page?

-Jason
 


Yep, cURL's a lot faster. Here's two functions that'll save you some headache.

PHP:
function do_get_request($url,  $optional_headers = null)
{
     $params = array('http' => array(
                  'method' => 'POST'
               ));
     if ($optional_headers !== null) {
        $params['http']['header'] = $optional_headers;
     }
     $ctx = stream_context_create($params);
     $fp = @fopen($url, 'rb', false, $ctx);
     if (!$fp) {
        throw new Exception("Problem with $url, $php_errormsg");
     }
     $response = @stream_get_contents($fp);
     if ($response === false) {
        throw new Exception("Problem reading data from $url, $php_errormsg");
     }
     return $response;
}

function do_post_request($url, $data, $optional_headers = null)
{
     $params = array('http' => array(
                  'method' => 'POST',
                  'content' => $data
               ));
     if ($optional_headers !== null) {
        $params['http']['header'] = $optional_headers;
     }
     $ctx = stream_context_create($params);
     $fp = @fopen($url, 'rb', false, $ctx);
     if (!$fp) {
        throw new Exception("Problem with $url, $php_errormsg");
     }
     $response = @stream_get_contents($fp);
     if ($response === false) {
        throw new Exception("Problem reading data from $url, $php_errormsg");
     }
     return $response;
}

EDIT:
Usage --
PHP:
echo do_get_request("http://google.com");
echo do_post_request("http://google.com", "a=ASDF&b=FDSA&c=FFFF");
 
Um... I don't think it's all that much faster... You can also do fopen so u can do it byte by byte, but curl will surely give you a heck of a lot more options, one being POST as uplinked explained.
 
cURL's way faster. This is just for fun but here are two versions of my script:

Index of /wfboobs

The one just uses this (noticably slower):

Code:
...
$scrape = file_get_contents($url);
...

And then the cURL one uses this:

Code:
...
$scrape = scrape($url);
...
function scrape($url) {
 $scrape = curl_init();
 curl_setopt($scrape, CURLOPT_URL, $url);
 curl_setopt($scrape, CURLOPT_RETURNTRANSFER, TRUE);
 $content = curl_exec($scrape);
 curl_close($scrape);
 return $content;
}
...

You can tell why this is an important project, so.. :D
 
cURL is faster, but not by much. How do I know? Benchmarking.

Testing downloading of a single file: `http://asdf.com`

Method: remoteFileGet_curl
Description: Typical cURL file read

Iterations: 100, Total: 5.2625095844269s
Minimum: 0.044484853744507s, Maximum: 0.1283860206604s
Average: 0.051935088877775s

Method: remoteFileGet_fileGet
Description: Using file_get_contents

Iterations: 100, Total: 7.6400473117828s
Minimum: 0.065523147583008s, Maximum: 0.1490421295166s
Average: 0.075770224843706s

Method: remoteFileGet_fopen_stream
Description: Using `fopen` and `stream_get_contents`

Iterations: 100, Total: 7.3246881961823s
Minimum: 0.065551996231079s, Maximum: 0.1088080406189s
Average: 0.072962532238084s

Method: remoteFileGet_fopen_traditional
Description: Using `fopen` and `fread` in a loop.

Iterations: 100, Total: 7.5380668640137s
Minimum: 0.066048860549927s, Maximum: 0.38200902938843s
Average: 0.072347030347707s


So ya, sure it's faster, and it's worth it, especially on a large scale, but 0.02s is a small margin. Tthe other methods aren't worth analyzing, they're so close. All that said, this really depends on your environment and the current system load.

Where you're really going to benefit from cURL is when you need to get more than one file at a time (scrapping, etc). In these cases, you'll want to check out the group of functions that start with `curl_multi_`.

If you want the code for this benchmark, feel free to ask.
 
Consider that the site you are pulling from may not respond fast either. Check the pages load time out using one of those firefox plugins.

If the page you are trying to grab takes 40 seconds to load in your browser it doesn't matter what you use to pull it - their server is only going to render the page so fast.
 
curl is more configurable (fuck faster you use curl multi for that). custom user agents, referers, https support, binary, cookies, post, get, whatever.

trust me, right now there is nothing online that cannot be scraped/spidered/automated with curl.
 
my simple curl muti class:

Code:
<?php
class curlmulti{
    public  $threads = 20;
    public  $timeout = 10; // seconds
    public  $UA      = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.7';    
    
    function makeThreads(){   
        if(count($this->urls)>$this->threads){
            $i = 0; $z = 0;
            foreach($this->urls as $url){
                if( strpos($url,'http://')===false ) $url = "http://".$url;
                $this->links[$z][] = $url;
                $i++;
                if($i == $this->threads){ $i = 0; $z++; }    
            }
        }
        else $this->links[0] = $this->urls;   
    }
    
    function fetch($urls){
        $this->urls = $urls;
        $this->makeThreads();
        foreach($this->links as $urls){     
            $socketh = curl_multi_init();
            foreach($urls as $i => $url){
                $socket[$i] = curl_init($url);
                curl_setopt($socket[$i], CURLOPT_RETURNTRANSFER, 1);
                curl_setopt($socket[$i], CURLOPT_FOLLOWLOCATION, 1);
                curl_setopt($socket[$i], CURLOPT_USERAGENT, $this->UA);
                curl_setopt($socket[$i], CURLOPT_MAXREDIRS, 4);
                curl_setopt($socket[$i], CURLOPT_TIMEOUT, $this->timeout);
                curl_setopt($socket[$i], CURLOPT_CONNECTTIMEOUT, 5);
                curl_setopt($socket[$i], CURLOPT_SSL_VERIFYPEER, 0);
                curl_multi_add_handle($socketh, $socket[$i]);
            }
            do { $x = curl_multi_exec($socketh, $working); } while( $working );
            foreach($urls as $i => $url){
                $this->html[] = curl_multi_getcontent( $socket[$i] );
                curl_close($socket[$i]);
            }
        }
        return $this->html;
    }
}
?>
fast, add setopt to flavour, abuse at will ;)