Google and rate limiting

NathanRidley · Jul 6, 2011

I've found that when scraping data from Google, they use fancy rate limiting algorithms that ban your IP address in progressively-increasing intervals the more you violate their rate limits.

To avoid hitting the rate limits (or at least to make it more efficient), I rotate proxies in and out of a pool. If a proxy was working and then gets rejected by Google, I rotate it out and let it cool off for a short time. When it comes out of that cool off period, if it still gets rejected by Google, I put it back in cooldown, but for a progressively longer time for each failure it gets. I've found that, with a decent number of proxies, this yields a fairly decent number of possible daily hits on Google from my proxies.

Also, rotating the user agent that I send in the headers seems to help lengthen the time until proxies start getting rejected.

Does anyone else have any tips to maximise how much they get out of their proxies before Google's rate limiting comes into effect?

dchuk · Jul 6, 2011

get more proxies and use a longer delay between standard hits and you'll never get rate limited in the first place. If you do for some reason, your method is the standard when dealing with rate limiting.

NathanRidley · Jul 6, 2011

The old "get more proxies" suggestion is a good one, though I have 500 and it's still not enough. Pity IP address blocks are such hot property...

jryan21 · Jul 16, 2011

If you're scraping other sites too, instead of rotating the proxies, rotate the sites you're scraping. This way, you're still always scraping something, but there would be more time in between requests to a given site

gutterseo · Jul 17, 2011

Also queries such as inurl and that contain 'powered by' gets your ip's blocked a lot quicker than normal queries.

dchuk · Jul 17, 2011

jryan21 said:
If you're scraping other sites too, instead of rotating the proxies, rotate the sites you're scraping. This way, you're still always scraping something, but there would be more time in between requests to a given site

that's what I do, I store a timestamp for each site/service I hit, that way I can hit google, yahoo, delicious, etc in quick succession rather than just using a global sleep

marisha john · Jul 30, 2011

Hello

The old "get more proxies" suggestion is a good one, though I have 500 and it's still not enough. Pity IP address blocks are such hot property...

ilovewf · Jul 30, 2011

if google blocks the ip, just enter human verification code to unlock the algorithm ..

mattseh · Jul 30, 2011

ilovewf said:
if google blocks the ip, just enter human verification code to unlock the algorithm ..

srsly brah?

Bofu2U · Jul 30, 2011

ilovewf said:
if google blocks the ip, just enter human verification code to unlock the algorithm ..

Well fuck why didn't I think of that? </sarcasm>

etali · Jul 30, 2011

Google's rate limiting is so strict that I've been limited when doing manual searches before! I think they need to loosen up their rules a little. I know they need to limit scrapers, but they have gone a little bit too far.

dchuk · Jul 30, 2011

etali said:
Google's rate limiting is so strict that I've been limited when doing manual searches before! I think they need to loosen up their rules a little. I know they need to limit scrapers, but they have gone a little bit too far.

why? they don't make money off of scrapers, they make money from humans clicking ads.

The golden rule is: "Google owes us nothing"

Search

Search

Google and rate limiting

NathanRidley

New member

dchuk

Senior Botter

NathanRidley

New member

jryan21

Level 4 Grindstone

gutterseo

▬▬▬▬▬▬▬&

dchuk

Senior Botter

marisha john

New member

ilovewf

Banned

mattseh

import this

Bofu2U

Automation Specialist

etali

New member

dchuk

Senior Botter

Google and rate limiting

New member

Senior Botter

New member

Level 4 Grindstone

&#9644;&#9644;&#9644;&#9644;&#9644;&#9644;&#9644;&

Senior Botter

New member

Banned

import this

Automation Specialist

New member

Senior Botter

▬▬▬▬▬▬▬&