Script for you: Scrape everything faster and easier with curl_multi and "threads"
I spent the last day and a half making this class, the goal of which is to make writing spiders for scraping much easier, and modularize them so that you can send out a bunch at once. I documented my code really really well so that a year from now I am not like "WTF is this shit" when I see it, and I figured I may as well put it on here too.
tl;dr for the following: you can read most of the explanation in the comments. It makes scraping with a bunch of requests easier and faster.
I have to give credit to "win-" (Blackhat SEO – rants from the dark side of marketing) (don't know if he is still around or even what fucking name he goes by now) for the Curl class. I think he would approve of my extension.
You can read everything in the comments, but the easiest way to understand what this does is:
Imagine a "spider" with an army of "ants". The spider is this Spider class. The ants are any little scraper you want to write. Now imagine that this spider is continuously taking "steps" forward on a "chain" of methods. At each step, it asks all the ants if they need any requests carried out (such as fetching a page, posting to a page, etc). If any of them do, they tell the spider what the requests are and give it the name of their next method to send the results to. Then the spider moves forward a step and repeats the process. Some ants might need nothing more, but other ants might still have requests. The spider continues until no ants need anything. This way, it groups the requests of all the ants together at each step and performs them in one curl_multi execution. Processing time may not be faster for small groups of ants, just because of the extra abstraction, but for a lot of ants making a lot of requests, it is definitely faster.
I attached all the files so you can see for yourself. Spider.php contains the Spider class, a QueenAnt class that each ant must extend, and a SampleWithLogin class. The Sample with login uses the "CurlTesting" folder as a demonstration to login to a site, check if it worked, grab all the page numbers it needs to look at, then open all those page numbers and find any proxies on it. Make sure you change YOURURL.com to whatever.
Enjoy :bootyshake:
I spent the last day and a half making this class, the goal of which is to make writing spiders for scraping much easier, and modularize them so that you can send out a bunch at once. I documented my code really really well so that a year from now I am not like "WTF is this shit" when I see it, and I figured I may as well put it on here too.
tl;dr for the following: you can read most of the explanation in the comments. It makes scraping with a bunch of requests easier and faster.
I have to give credit to "win-" (Blackhat SEO – rants from the dark side of marketing) (don't know if he is still around or even what fucking name he goes by now) for the Curl class. I think he would approve of my extension.
You can read everything in the comments, but the easiest way to understand what this does is:
Imagine a "spider" with an army of "ants". The spider is this Spider class. The ants are any little scraper you want to write. Now imagine that this spider is continuously taking "steps" forward on a "chain" of methods. At each step, it asks all the ants if they need any requests carried out (such as fetching a page, posting to a page, etc). If any of them do, they tell the spider what the requests are and give it the name of their next method to send the results to. Then the spider moves forward a step and repeats the process. Some ants might need nothing more, but other ants might still have requests. The spider continues until no ants need anything. This way, it groups the requests of all the ants together at each step and performs them in one curl_multi execution. Processing time may not be faster for small groups of ants, just because of the extra abstraction, but for a lot of ants making a lot of requests, it is definitely faster.
I attached all the files so you can see for yourself. Spider.php contains the Spider class, a QueenAnt class that each ant must extend, and a SampleWithLogin class. The Sample with login uses the "CurlTesting" folder as a demonstration to login to a site, check if it worked, grab all the page numbers it needs to look at, then open all those page numbers and find any proxies on it. Make sure you change YOURURL.com to whatever.
Enjoy :bootyshake: