Script for you: Scrape everything faster and easier with curl_multi and "threads"

chatmasta

New member
Jan 7, 2007
2,613
68
0
NYC
Script for you: Scrape everything faster and easier with curl_multi and "threads"

I spent the last day and a half making this class, the goal of which is to make writing spiders for scraping much easier, and modularize them so that you can send out a bunch at once. I documented my code really really well so that a year from now I am not like "WTF is this shit" when I see it, and I figured I may as well put it on here too.

tl;dr for the following: you can read most of the explanation in the comments. It makes scraping with a bunch of requests easier and faster.

I have to give credit to "win-" (Blackhat SEO – rants from the dark side of marketing) (don't know if he is still around or even what fucking name he goes by now) for the Curl class. I think he would approve of my extension.

You can read everything in the comments, but the easiest way to understand what this does is:

Imagine a "spider" with an army of "ants". The spider is this Spider class. The ants are any little scraper you want to write. Now imagine that this spider is continuously taking "steps" forward on a "chain" of methods. At each step, it asks all the ants if they need any requests carried out (such as fetching a page, posting to a page, etc). If any of them do, they tell the spider what the requests are and give it the name of their next method to send the results to. Then the spider moves forward a step and repeats the process. Some ants might need nothing more, but other ants might still have requests. The spider continues until no ants need anything. This way, it groups the requests of all the ants together at each step and performs them in one curl_multi execution. Processing time may not be faster for small groups of ants, just because of the extra abstraction, but for a lot of ants making a lot of requests, it is definitely faster.

I attached all the files so you can see for yourself. Spider.php contains the Spider class, a QueenAnt class that each ant must extend, and a SampleWithLogin class. The Sample with login uses the "CurlTesting" folder as a demonstration to login to a site, check if it worked, grab all the page numbers it needs to look at, then open all those page numbers and find any proxies on it. Make sure you change YOURURL.com to whatever.

Enjoy :bootyshake:
 

Attachments

  • Spider.zip
    10.5 KB · Views: 53
  • Like
Reactions: DewChugr


My b I forgot to mention that in the OP. Obviously PHP isn't threaded, but I know I'm not the only one here who codes most of my scrapers in PHP. So you want to do the best you can. This uses curl_multi which is not threaded, but cURL does send the requests simultaneously. The PHP processes are not threaded, but they are also not the part of the code that takes up the bulk of the time. So to answer your question, no I am not claiming this as threaded. I am saying this about the closest you will get to threaded with PHP. However, it's also a free script that I just spent a day working on and tossed on wickedfire, so I couldn't give a shit who thinks it is threaded or not threaded.
 
yeah, just wanted to get clarification.. i know some people have tried to make 'thread" like stuff before so was just gauging
 
Nice script, I might check it out. Honestly, I prefer Perl for scraping though, simply due to the threading issue.
 
If anyone is actually using this, updated it to make it a lot faster. It was hogging a lot of memory before especially if you gave it a lot of URLs. The reason was because it was holding onto so many huge CurlRequest objects. Now it just holds the results you specify in $this->addRequest (by default it holds 'body'). Also results are now an array so you access it like
Code:
$results = $this->getResults; $body = $results['body']; // used to be $results->body;

Updated code:

Code:
<?php

class Spider {
    protected $curl;
    protected $ants, $next_methods, $requests, $results, $return, $keep;
    
    /* constructor */
    public function __construct(&$curl) {
        // instance of the Curl class
        $this->curl = $curl;
        
        // $this->ants[$ant_name] = instance of $ant_name
        $this->ants = array();
        
        // $this->next_methods[$ant_name] = next method to execute in $ant_name
        $this->next_methods = array();
        
        // $this->requests[$ant_name] = array( ... all upcoming requests for $ant_name ... )
        $this->requests = array();
        
        // $this->keep[$ant_name][$new_request_id] = array( ... what to keep from the result object ... )
        $this->keep = array();
        
        // $this->return[$ant_name] = last return value from $ant_name
        // this is the return value of $this->perform
        $this->return = array();
    }
    
    /*
        *    *    *     METHODS TO BE CALLED BY THE SPIDER HANDLER    *    *    *
    */
    
    /* addAnt($ant) - $ant is the name of the ant class, $proxy is array(ip,port,[username,pass]) */
    public function addAnt($ant, $proxy = null) {
        $ant = new $ant($this, $proxy);
        $ant_name = $ant->getName();
        $this->ants[$ant_name] = $ant;
    }
    
    /* perform all requests at this step. store results. then, move forward a step. if there are 
        still requests to be performed or methods to be executed, repeat the process. To visualize
        each step, uncomment the echos */
    public function perform() {
        echo '<hr>';
        echo 'This is a step. The following are the curl requests to be performed simultaneously:<br />';
        
        // Add all the requests from this step to the curl handle, then perform them
        // Format for request_ids: <$this_execution (unique id for this step)>:<$ant_name>:<$request_id>
        // $request_id is either the custom ID set in $this->addRequest, or it is numeric
        $this_execution = uniqid();
        foreach(array_keys($this->requests) as $ant_name) {
            foreach(array_keys($this->requests[$ant_name]) as $request_id) {
                $new_request_id = $this_execution . ':' . $ant_name . ':' . $request_id;
                $this->curl->add(&$this->requests[$ant_name][$request_id]['request'], $new_request_id);
                $this->keep[$ant_name][$new_request_id] = $this->requests[$ant_name][$request_id]['keep'];

                echo $this->requests[$ant_name][$request_id]['request']->url . '<br />';
            }
            $this->clearRequests($ant_name); // note: also clears results
        }
        $this->curl->perform();

        // Store the results of those requests (from $this->curl->requests) in $this->results. 
        // Make sure that the result is from this execution, and  to only store what we need to ($keep).
        foreach(array_keys($this->curl->requests) as $result_id) {
            list($exec_number, $ant_name, $request_id) = explode(':', $result_id);
            if($exec_number == $this_execution) {
                foreach($this->keep[$ant_name][$result_id] as $keep)
                    $this->results[$ant_name][$request_id][$keep] = $this->curl->requests[$result_id]->$keep;
                unset($this->curl->requests[$result_id]);
            }
        }    
        
        // Execute the next methods for each ant
        $this->executeNextMethods();        
    
        // If, after executing the next methods (which often add new requests and further steps),
        // there are methods or requests remaining, execute function again
        if($this->methodsRemain() || $this->requestsRemain()) 
            $this->perform();
        
        // Return. This is an array containing the last returned value for each ant.    
        return $this->return;
    }
    
    
    /*
        *    *    *    METHODS TO BE CALLED BY THE ANT(S)    *    *    *
    */
    
    /* add a new request. When results are returned, the indices of the results array are the
        request IDs. By default these are numerical and assigned by the curl class. But if an ant
        sets $custom_request_id, that is the index for that result. As a rule, if you set a custom id
        for one request in a step, it is safest to set a custom id for all requests in that step. 
        also define what parts of the result to keep. defaults to body. can be array or single text */
    public function addRequest(&$request, $ant_name, $custom_request_id = null, $keep = array('body')) {
        if(!array_key_exists($ant_name, $this->requests)) $this->requests[$ant_name] = array();
        if(!is_array($keep)) $keep = array($keep);

        if(!is_null($custom_request_id))
            $this->requests[$ant_name][$custom_request_id] = array('request' => $request, 'keep' => $keep);
        else
            $this->requests[$ant_name][] = array('request' => $request, 'keep' => $keep);
    }
    
    /* pass the results from executing current requests onto the next method, $method_name() */    
    public function passToMethod($method_name, $ant_name) {
        $this->next_methods[$ant_name] = $method_name;
    }
    
    /* 
        *    *    *    INTERNAL METHODS    *    *    *
    */
    
    /* go through each ant, execute the next method. i.e. move forward a step */
    private function executeNextMethods() {
        foreach($this->next_methods as $ant_name => $next_method) {
            unset($this->next_methods[$ant_name]);
            //echo 'Executing ' . $next_method . ' on ' . $ant_name . '<br />';
            $this->return[$ant_name] = $this->ants[$ant_name]->$next_method();
        }
    }
    
    /* clear the requests and results for $ant_name */
    private function clearRequests($ant_name) {
        $this->requests[$ant_name] = array();
        $this->clearResults($ant_name); // also clear results in preparation for new ones
    }
    
    /* return true if there are any remaining requests to be performed, false otherwise */
    private function requestsRemain() {
        foreach($this->requests as $ant_name => $requests) {
            if(count($requests) > 0) return true;
        }
        return false;
    }
    
    /* return true if there are any remaining methods to be executed, false otherwise */
    private function methodsRemain() {
        return (count($this->next_methods) > 0) ? true : false;
    }
    
    /* return $this->results for $ant_name. $this->results is array of results from the
       last executed curl requests */
    public function getResults($ant_name, $keep = false) {
        $return = $this->results[$ant_name];
        if(!$keep) $this->clearResults($ant_name);
        return $return;
    }
    
    public function clearResults($ant_name) {
        $this->results[$ant_name] = array();
    }
}

/* All ants should extend QueenAnt. This makes it easy to interact with Spider. */
class QueenAnt {
    protected $spider, $ant_name, $proxy;
    public function __construct($spider, $proxy = null) {
        $this->ant_name = get_class($this) . uniqid();
        $this->spider = $spider;
        if(is_array($proxy) && (!array_key_exists('ip', $proxy) || !array_key_exists('port', $proxy)) ) 
            $proxy = null;
        $this->proxy = $proxy;
    }
    public function getName() {
        return $this->ant_name;
    }
    public function setProxy($proxy) {
        $this->proxy = $proxy;
    }
    protected function passToMethod($method_name) {
        $this->spider->passToMethod($method_name, $this->ant_name);
    }
    protected function addRequest($request, $id = NULL) {
        if(!is_null($this->proxy)) {
            if(!array_key_exists('username', $this->proxy)) $this->proxy['username'] = null;
            if(!array_key_exists('password', $this->proxy)) $this->proxy['password'] = null;
            $request->proxy($this->proxy['ip'], $this->proxy['port'], $this->proxy['username'], $this->proxy['password']);
        }    
        if(!is_null($id))
            $this->spider->addRequest($request, $this->ant_name, $id);
        else    
            $this->spider->addRequest($request, $this->ant_name);
    }
    protected function getResults($keep = false) {
        return $this->spider->getResults($this->ant_name);
    }
    protected function error($message, $flag = null) {
        $arr = array('error' => $message, 'flag' => $flag, 'proxy' => $this->proxy, 'ant' => get_class($this), 'thread' => $this->ant_name);
        return $arr;
    }
    
    public static function str_between($start, $end, $str){
        if($start == $end) {
            $parts = explode($start, $str);
            return $parts[1];
        }
        if($str[0] == $start) {
            $parts = explode($start, $str, 2);
            $parts = explode($end, $parts[1]);
            return $parts[0];
        }
        else {
            $parts = explode($start, $str);
            $parts = explode($end, $parts[1]);
            return $parts[0];
        }
    }
}

?>
 
Thanks for sharing. Not exactly sure how to go about playing with this thing but I will dive in and check it out.

My main issue has always been diving into categories. I end up building many php pages and a huge monster to do what I want.

I build the first level to get the categories and save them, the second level to get the subs, the third level to get the sub subs, and another to scrape the actual data and another to put it all together into an organized csv.

Then I semi-manually change the code/scrape, change the code/scrape. A custom page to iterate between all the subsubs and save the data, then make another page for the next sub-subsubs. Ialso try to put some delays in the whole shebang so it isn't hammering the server too badly.

I know there's a better way and I'm hoping this will point me in the right direction. Basically what I need is code that writes itself as it learns the hierarchy and sub-subsubs-data names.

I am not that good with php, but I eventually get it done. I appreciate the code and the ideas.
 
If you actually dig deep into it, just hit me up and I'd be happy to help with anything. Here's an "ant" I wrote tonight to scrape proxies from some Japanese website:

Code:
<?php
class Sakura extends QueenAnt {
    private $proxies;
    public function __construct($spider, $proxy = null) {
        parent::__construct($spider, $proxy);
        $this->proxies = array();
        $this->getPages_init();
    }
    /* init: get first page of proxies. parse: get number of pages and get the proxies on the first page */
    public function getPages_init() {
        $this->addRequest(new CurlRequest('http://proxylist.sakura.ne.jp/index.htm'));
        $this->passToMethod('getPages_parse');
    }
    public function getPages_parse() {
        $results = $this->getResults();
        $body = $results[0]['body'];

        // page nums are 0, 1, 2 ... etc. $num_pages = 6 means 0,1,2,3,4,5. We already have 0 here.
        $num_pages = substr_count($body, '<A href="index.htm?pages=') - 1;
        
        // get proxies from this page now so that we don't have to load it again
        $this->proxies = array_merge($this->proxies, self::parseProxies($body));
                
        // go to getProxies
        $this->getProxies_init($num_pages);
    }
    
    /* init: add URLs for all pages we don't have. parse: feed bodies of all pages to self::parseProxies */
    public function getProxies_init($num_pages) {
        // add the URLs & pass to parser. we already have the first page (0) so skip that
        for($i = 1; $i < $num_pages; $i++)
            $this->addRequest(new CurlRequest('http://proxylist.sakura.ne.jp/index.htm?pages=' . $i));
        $this->passToMethod('getProxies_parse');
    }
    public function getProxies_parse() {
        $results = $this->getResults();
        foreach($results as $result) {
            $body = $result['body'];
            $this->proxies = array_merge($this->proxies, self::parseProxies($body));
        }
        return $this->proxies;
    }
    
    /* parse proxies */
    private static function parseProxies($body) {
        $proxies = array();
        
        // encryption scheme is all contained in a small portion of JS at the top. there are 4 cases,
        // and when the proxy function is called, depending on the case, it shifts the IP/port parts accordingly
        // when we are done cleaning up the string for our purposes, each $cases[1,2,3,4] will look like:
        // arg2++arg3++arg4++arg1++port
        $encryption_scheme = parent::str_between('switch(mode) {', '}', $body);
        $encryption_scheme = parent::strip_whitespace($encryption_scheme);
        $encryption_scheme = str_replace(array('case1','case2','case3','case4','break;'), '', $encryption_scheme);
        list($cases[1], $cases[2], $cases[3], $cases[4]) = explode(';', $encryption_scheme);
        foreach(array_keys($cases) as $k) $cases[$k] = str_replace(array(':ret=', '"."', '":"'), '', $cases[$k]);
        
        // go through all addresses
        $addresses = explode('<TD id="address">', $body);
        array_shift($addresses);
        foreach($addresses as $address) {
            // get the proxy function and read its args. first arg is mode, rest of args are address
            if(!strstr($address, 'proxy(')) continue;
            $address = parent::str_between('proxy(', ');', $address);
            $args = explode(',', $address);
            $mode = $args[0];

            // now that we know the case, use it to build the IP string
            $ip = $cases[$mode];
            for($i = 1; $i <= 4; $i++)
                $ip = str_replace('arg' . $i, $args[$i], $ip);
            
            // clean up
            $ip = str_replace('+', '', $ip);
            $ip = str_replace("'", ".", $ip);
            $ip = str_replace('..', '.', $ip);
            $ip = str_replace('.port', '', $ip);
            $ip = trim($ip, '.');
            
            // port is the last arg
            $port = $args[5];
            
            // add to proxies array
            if(!array_key_exists($ip, $proxies)) $proxies[$ip] = $port;
        }
        
        return $proxies;
    }
}    
?>
 
What about memory usage?
I never find scraping speed to be really an issue, it just runs in the background... But (mostly because I can't be bothered to fix it) my scripts often crash due using too much memory, I can just resume ofc but takes some time...
 
You just have to make sure you cycle through data quickly. Don't hold onto anything for too long, because that's when you use up the memory. Once you have the data, put it into your DB or text file or whatever, then dump it and move on. That was the problem with the original class I posted in this thread (before the updated one). It was holding onto every CurlObject it found, which are huge objects. So now after it fetches results, it only takes the necessary data from the CurlObject, then kills the CurlObject and only holds onto the data for the rest of the round.
 
Thank you for sharing code. It's good for the universe, and good for your soul.

I politely request that you consider getting a Github account and putting your code in libraries somewhere they can be forked and maintained.
 
@Flash4Ever: thanks man haha. I used to be really terrible. I'd like to think I'm okay now.

@uplinked: I feel like a github just for me would be a little boring and pointless. Would there be any interest in getting a wickedfire github going, though? We could all share our projects on there.
 
I'm actively sharing my python projects on github, things that I'm working on every day, and I'm working with a few other WF coders to get their work published as well.

Follow me, if you wanna: http://github.com/linked :)

Everyone needs to be sharing more code.
 
This is a kickass class, I'm looking over the code now trying to digest it as I'm just getting started coding. To use this class, don't we need the QueenAnt Class since Sakura is a child of it? I'd like to see the QueenAnt class to get a better understaning on how to write an Ant. I think it would be cool for those who write ants to post them here! I will definitely do so.

Thanks for sharing the code.

If you actually dig deep into it, just hit me up and I'd be happy to help with anything. Here's an "ant" I wrote tonight to scrape proxies from some Japanese website:

Code:
<?php
class Sakura extends QueenAnt {
    private $proxies;
    public function __construct($spider, $proxy = null) {
        parent::__construct($spider, $proxy);
        $this->proxies = array();
        $this->getPages_init();
    }
    /* init: get first page of proxies. parse: get number of pages and get the proxies on the first page */
    public function getPages_init() {
        $this->addRequest(new CurlRequest('http://proxylist.sakura.ne.jp/index.htm'));
        $this->passToMethod('getPages_parse');
    }
    public function getPages_parse() {
        $results = $this->getResults();
        $body = $results[0]['body'];

        // page nums are 0, 1, 2 ... etc. $num_pages = 6 means 0,1,2,3,4,5. We already have 0 here.
        $num_pages = substr_count($body, '<A href="index.htm?pages=') - 1;
        
        // get proxies from this page now so that we don't have to load it again
        $this->proxies = array_merge($this->proxies, self::parseProxies($body));
                
        // go to getProxies
        $this->getProxies_init($num_pages);
    }
    
    /* init: add URLs for all pages we don't have. parse: feed bodies of all pages to self::parseProxies */
    public function getProxies_init($num_pages) {
        // add the URLs & pass to parser. we already have the first page (0) so skip that
        for($i = 1; $i < $num_pages; $i++)
            $this->addRequest(new CurlRequest('http://proxylist.sakura.ne.jp/index.htm?pages=' . $i));
        $this->passToMethod('getProxies_parse');
    }
    public function getProxies_parse() {
        $results = $this->getResults();
        foreach($results as $result) {
            $body = $result['body'];
            $this->proxies = array_merge($this->proxies, self::parseProxies($body));
        }
        return $this->proxies;
    }
    
    /* parse proxies */
    private static function parseProxies($body) {
        $proxies = array();
        
        // encryption scheme is all contained in a small portion of JS at the top. there are 4 cases,
        // and when the proxy function is called, depending on the case, it shifts the IP/port parts accordingly
        // when we are done cleaning up the string for our purposes, each $cases[1,2,3,4] will look like:
        // arg2++arg3++arg4++arg1++port
        $encryption_scheme = parent::str_between('switch(mode) {', '}', $body);
        $encryption_scheme = parent::strip_whitespace($encryption_scheme);
        $encryption_scheme = str_replace(array('case1','case2','case3','case4','break;'), '', $encryption_scheme);
        list($cases[1], $cases[2], $cases[3], $cases[4]) = explode(';', $encryption_scheme);
        foreach(array_keys($cases) as $k) $cases[$k] = str_replace(array(':ret=', '"."', '":"'), '', $cases[$k]);
        
        // go through all addresses
        $addresses = explode('<TD id="address">', $body);
        array_shift($addresses);
        foreach($addresses as $address) {
            // get the proxy function and read its args. first arg is mode, rest of args are address
            if(!strstr($address, 'proxy(')) continue;
            $address = parent::str_between('proxy(', ');', $address);
            $args = explode(',', $address);
            $mode = $args[0];

            // now that we know the case, use it to build the IP string
            $ip = $cases[$mode];
            for($i = 1; $i <= 4; $i++)
                $ip = str_replace('arg' . $i, $args[$i], $ip);
            
            // clean up
            $ip = str_replace('+', '', $ip);
            $ip = str_replace("'", ".", $ip);
            $ip = str_replace('..', '.', $ip);
            $ip = str_replace('.port', '', $ip);
            $ip = trim($ip, '.');
            
            // port is the last arg
            $port = $args[5];
            
            // add to proxies array
            if(!array_key_exists($ip, $proxies)) $proxies[$ip] = $port;
        }
        
        return $proxies;
    }
}    
?>
 
Yeah this is a great class. I'm trying to figure it out.

You are missing a method in the QueenAnt class.

Code:
PHP Fatal error:  Call to undefined method QueenAnt::strip_whitespace()

So, do I simply need to add this to QueenAnt class to have the correct method called.

Code:
//protected so only children can use
protected function strip_whitespace($string) {
        trim($string);
        return $this->string;
    }