Scraping Google for complex queries - getting CAPTCHA locked

moltar

Violating Airspace
Jul 8, 2009
987
10
0
Canada
I am not a noob at scraping, but can't figure this one out. Google seems to have a problem with complex queries even with manual searches. Example of a query:

"fruits are good for you" inurl:apples -inurl:eek:ranges

Even if I go thru Firefox and manually unlock the CAPTCHA, on the next attempt it locks out again.

I even wrote a script to run thru all my proxies and unlock them, but on the next run they are all locked already.

This doesn't happen for simple queries.

Anyone figured out a way around this?

5Dbo


Thank you!
 


query worked fine for me. Chances are your proxies are fubared, are you using cheap or free proxies? try a different source or set some longer delays and thread the bitch
 
It is definitely a combination of proxy and query. I can query that without proxies just fine. But thru proxy I get cock-blocked. Though simple queries, even in fast succession go thru just fine as well.

I made my own proxies. They are completely private and undetectable. There can't be a better proxy than this.
 
I believe they have some kind of a delay based on how often you run queries like that. Anything that's repetitive, or heavy on the search-mechanics stuff seems to get blocked faster. This has happened to me a few times just doing manual inurl: and site: queries multiple times.

Maybe add something to your script that searches generic fluff after it fills in the captcha. Scrape Google trends and run those queries in a random order with a 3-6 second random delay between.

Speaking pretty theoretically here, but I definitely think Google throttles inurl and site queries sooner than garbage like "how is babby formed"
 
Speaking pretty theoretically here, but I definitely think Google throttles inurl and site queries sooner than garbage like "how is babby formed"

Makes sense. TBH I can't even think of any whitehat applications of inurl operators and all of the ones in that family. :)
 
Same thing happening to me recently with very long queries including something in "" and an inurl:

A possible solution would be doing simultaneous requests for each page of results through different proxies at the same time but whats the point. I mean the results were looking for don't need to be instant, set a large timeout and run your script over half an hour period.
 
Just out of curiosity do you have a minimum suggested delay time?

I usually do a random 5-15 second delay on the initial search, 5-10 for paging through results. With threads, the delays become a moot point because everything is in parallel so you can be a bit less aggressive and still get a lot done.

However, I haven't done much scraping with the inurl: modifiers, so the other guys might be right that they're throttling those searches back. Makes sense, they can precompute a lot of the search results, but those kinds are most likely done on the fly and harder on the Googz. Plus, most non-bot non-spamming users won't be slamming them with inurl: searches :)

Might be worth dickin around in scrapebox to quickly test some theories