I've found that when scraping data from Google, they use fancy rate limiting algorithms that ban your IP address in progressively-increasing intervals the more you violate their rate limits.
To avoid hitting the rate limits (or at least to make it more efficient), I rotate proxies in and out of a pool. If a proxy was working and then gets rejected by Google, I rotate it out and let it cool off for a short time. When it comes out of that cool off period, if it still gets rejected by Google, I put it back in cooldown, but for a progressively longer time for each failure it gets. I've found that, with a decent number of proxies, this yields a fairly decent number of possible daily hits on Google from my proxies.
Also, rotating the user agent that I send in the headers seems to help lengthen the time until proxies start getting rejected.
Does anyone else have any tips to maximise how much they get out of their proxies before Google's rate limiting comes into effect?
To avoid hitting the rate limits (or at least to make it more efficient), I rotate proxies in and out of a pool. If a proxy was working and then gets rejected by Google, I rotate it out and let it cool off for a short time. When it comes out of that cool off period, if it still gets rejected by Google, I put it back in cooldown, but for a progressively longer time for each failure it gets. I've found that, with a decent number of proxies, this yields a fairly decent number of possible daily hits on Google from my proxies.
Also, rotating the user agent that I send in the headers seems to help lengthen the time until proxies start getting rejected.
Does anyone else have any tips to maximise how much they get out of their proxies before Google's rate limiting comes into effect?