Scraping Google for complex queries - getting CAPTCHA locked

moltar · Jul 28, 2010

I am not a noob at scraping, but can't figure this one out. Google seems to have a problem with complex queries even with manual searches. Example of a query:

"fruits are good for you" inurl:apples -inurl

ranges

Even if I go thru Firefox and manually unlock the CAPTCHA, on the next attempt it locks out again.

I even wrote a script to run thru all my proxies and unlock them, but on the next run they are all locked already.

This doesn't happen for simple queries.

Anyone figured out a way around this?

Thank you!

bbrock32 · Jul 28, 2010

Use decaptcher!

moltar · Jul 28, 2010

I am. I told you I wrote a script to unlock it. That doesn't help.

dchuk · Jul 28, 2010

query worked fine for me. Chances are your proxies are fubared, are you using cheap or free proxies? try a different source or set some longer delays and thread the bitch

moltar · Jul 28, 2010

It is definitely a combination of proxy and query. I can query that without proxies just fine. But thru proxy I get cock-blocked. Though simple queries, even in fast succession go thru just fine as well.

I made my own proxies. They are completely private and undetectable. There can't be a better proxy than this.

SniperRyan · Jul 28, 2010

I believe they have some kind of a delay based on how often you run queries like that. Anything that's repetitive, or heavy on the search-mechanics stuff seems to get blocked faster. This has happened to me a few times just doing manual inurl: and site: queries multiple times.

Maybe add something to your script that searches generic fluff after it fills in the captcha. Scrape Google trends and run those queries in a random order with a 3-6 second random delay between.

Speaking pretty theoretically here, but I definitely think Google throttles inurl and site queries sooner than garbage like "how is babby formed"

moltar · Jul 28, 2010

SniperRyan said:
Speaking pretty theoretically here, but I definitely think Google throttles inurl and site queries sooner than garbage like "how is babby formed"

Makes sense. TBH I can't even think of any whitehat applications of inurl operators and all of the ones in that family.

gutterseo · Jul 28, 2010

Same thing happening to me recently with very long queries including something in "" and an inurl:

A possible solution would be doing simultaneous requests for each page of results through different proxies at the same time but whats the point. I mean the results were looking for don't need to be instant, set a large timeout and run your script over half an hour period.

justo_tx · Jul 28, 2010

dchuk said:
try a different source or set some longer delays and thread the bitch

Just out of curiosity do you have a minimum suggested delay time?

dchuk · Jul 28, 2010

justo_tx said:
Just out of curiosity do you have a minimum suggested delay time?

I usually do a random 5-15 second delay on the initial search, 5-10 for paging through results. With threads, the delays become a moot point because everything is in parallel so you can be a bit less aggressive and still get a lot done.

However, I haven't done much scraping with the inurl: modifiers, so the other guys might be right that they're throttling those searches back. Makes sense, they can precompute a lot of the search results, but those kinds are most likely done on the fly and harder on the Googz. Plus, most non-bot non-spamming users won't be slamming them with inurl: searches

Might be worth dickin around in scrapebox to quickly test some theories

Search

Search

Scraping Google for complex queries - getting CAPTCHA locked

moltar

Violating Airspace

bbrock32

New member

moltar

Violating Airspace

dchuk

Senior Botter

moltar

Violating Airspace

SniperRyan

And you can too!

moltar

Violating Airspace

gutterseo

▬▬▬▬▬▬▬&

justo_tx

My Member is Premium

dchuk

Senior Botter

Scraping Google for complex queries - getting CAPTCHA locked

Violating Airspace

New member

Violating Airspace

Senior Botter

Violating Airspace

And you can too!

Violating Airspace

&#9644;&#9644;&#9644;&#9644;&#9644;&#9644;&#9644;&

My Member is Premium

Senior Botter

▬▬▬▬▬▬▬&