Question for Experienced Scrapers

Status
Not open for further replies.

dogfighter

Irish Prick
May 21, 2007
1,153
12
0
Rain City
Rating - 100%
6   0   0
I've one my fair share of scraping since Bofu started imparting some of his PHP knowledge on my dumb ass. However, I never had to scrape a site like Google that will hit you with a captcha if your behavior does not seem... human.

How do you deal with that? I'm already spoofing my user agent but I know that's not enough. I could add some random sleep() times into the script if that would help. It would slow things down but I can live with that. What I really don't want to have to do is use a proxy for each new page.

And if it matters, I'm using file_get_contents, not cURL.
 


I've one my fair share of scraping since Bofu started imparting some of his PHP knowledge on my dumb ass.

He's the master

How do you deal with that? I'm already spoofing my user agent but I know that's not enough. ... And if it matters, I'm using file_get_contents, not cURL.

I didn't even know those 2 were compatible until I just did a search and you can do an ini_set() for them ... huh, learn something new every day.

I typically use curl and set the referrer to the Gg homepage ... that I'm pretty sure you can't do with f_g_c

As far as getting around your problem without proxies ... just get the script to dump to text file or db and set a cron to run every 3-5 minutes and you'll never get banned unless you're doing some off the wall queries ... in which case, I'd suggest you scrape one of Gg's sister sites instead of google herself.
 
You make an interesting point there. I have an array of URLs that I increment in a do/while loop with $i++... maybe I need to set my referrer to --$i. Thanks bro.

Open to any other suggestions as well.
 
I haven't scraped google in a while but I never had any problems when I Curled the google home page input my search scraped and some random usleep timers in there. When your page requests are faster than a human could possibly do it, you just look like a bot.
 
Status
Not open for further replies.