Ways of detecting GoogleBot

tomyates · Aug 7, 2013

Hi Guys

Has anyone got any info / research of complex or simple ways of detecting googlebots Via PHP / Javascript or a combination.

I know of the standard User Agent check (however google sometimes hides this or acts as a normal web user)

I know of building a list of google IPs and identifying that way.

What other ways have you found?

Does the google bot store cookies and sessions?

Is there any tel tale signs of it in the $_SERVER variable?

Please post your thoughts

Cheers
T

Dragonfly · Aug 7, 2013

tomyates said:
Does the google bot store cookies and sessions?

The GoogleBots I run into don't store cookies. I'm not sure about sessions though. Also I noticed they never come from a referring url, just land on the page. I can only assume they crawl a site, get a list of links, then go down the list without continuous spidering, at least that's what I noticed.

dchuk · Aug 7, 2013

Build a few honeypot sites and get them indexed and then just log all the traffic it gets, see if you can come up with some heuristics.

Kiopa_Matt · Aug 7, 2013

When the User-Agent HTTP header contains "Googlebot"

If you're looking to detect when a Google employee visits your site, you're not going to be able to, so don't bother trying.

mattseh · Aug 7, 2013

Kiopa_Matt said:
When the User-Agent HTTP header contains "Googlebot"

If you're looking to detect when a Google employee visits your site, you're not going to be able to, so don't bother trying.

Sometimes the Google Bot visits with normal UA.

OP, Google Bot probably uses a heavily customized webkit, maybe it doesn't match its features perfectly with the major browsers :thumbsup:

PureSalt · Aug 7, 2013

Detecting GoogleBot and MSNBot with PHP:

At work I’m working on a project to publish a large number of subscription only pages to the web, to be indexed by Google, and hopefully drum up some interest in the website. Whilst we want Google to have full access to the documents (to index, not cache them), anybody who isn’t a search engine will have to cough up money to get the full text of the document (hence no-cache).

function isCrawler()
{

$ua=$_SERVER['HTTP_USER_AGENT'];
$ip=$_SERVER['REMOTE_ADDR'];

// printf("User agent is %s
", $ua);
// printf("Remote IP is %s
", $ip);
// printf("RDNS is %s
", gethostbyaddr($ip));

// If this looks like a crawler
if (
(strpos($ua, "Googlebot")!==false) ||
(strpos($ua, "MSNBot")!==false)
)
{

// print "Possibly crawler based on user-agent string
";

// Check the reverse DNS
$rdns=gethostbyaddr($ip);
if(
(substr($rdns, -15, 15)==".search.msn.com") ||
(substr($rdns, -14, 14)==".googlebot.com")
)
{

// print "RDNS matches ($rdns)
";

// Check that the RDNS and FDNS match up
// - somebody may have spoofed an IP to PTR to *.googlebot.com
$fwd=gethostbynamel($rdns);
if(in_array($ip, $fwd))
{

// print "FWD DNS matches
";
return true;

} else {

// Failed DNS -> RDNS check
return false;

}

} else {

// Failed RDNS check
return false;

}

} else {

// Failed user-agent check
return false;

}

return false;

}

ToddW · Aug 7, 2013

^-- You covered the VERY basics.

RDNS on the fly SUCKS for user experience too.

I can tell you after analyzing millions of IPs Google Employees and Google Bots won't be filtered by what you've done. You're going to do just enough that G will figure out you're doing something fishy

and could penalize you for serving different content to them than the "average" Visitor.

Good luck getting someone to share this, let alone share it in a public forum.

What was posted above is about as specific as most people will get.

mattseh · Aug 7, 2013

PureSalt, please tell me that code not being indented at all is a copy and paste error.

tomyates · Aug 8, 2013

Thanks for input guys.

May also look into the cookie side of things as a further check.

eliquid · Aug 8, 2013

At first I was gonna be mad at that code posted above, then I LOL'd as I seen it wasn't what I thought.

Another day!

(O_o) · Aug 8, 2013

block entire IP ranges bruh, aint hard.

66.249.73.161 - IP in United States, Mountain View - Comments and Complaints

tomyates · Aug 8, 2013

(O_o) said:
block entire IP ranges bruh, aint hard.

66.249.73.161 - IP in United States, Mountain View - Comments and Complaints

Surely they have infinite amount of IP's from all over the world?

ToddW · Aug 8, 2013

tomyates said:
Surely they have infinite amount of IP's from all over the world?

I`m hoping he was joking.

tainted · Aug 10, 2013

All google IPs are publicly listed at ARIN. Scrape them and blacklist them. This has been good enough for me, and I've been running cloaked sites for over a year.

Note: Useragent check isn't good enough. I've caught googlebot faking an IE 6 useragent.

mattseh · Aug 11, 2013

tainted said:
All google IPs are publicly listed at ARIN. Scrape them and blacklist them. This has been good enough for me, and I've been running cloaked sites for over a year.

Note: Useragent check isn't good enough. I've caught googlebot faking an IE 6 useragent.

Is this disinfo? The seed of doubt is now in your mind

Ping me on skype sometime bro, we have the same timezone for now

tainted · Aug 12, 2013

It's much harder than simply scraping Arin records for GOGL. Google has IP addresses listed registered to multiple corporate entities/handles. It's up to you to figure out all of them and scrape them

The same can be said for Facebook (although they're much easier to cloak). Still figuring out Bing.

I am on Skype 24/7. Hit me up anytime matt ^^.

Panic Not · Aug 13, 2013

Statcounter used to be able to spot the bots, haven't check in a while if it still does

Search

Search

Ways of detecting GoogleBot

tomyates

Web Developer

Dragonfly

Member

dchuk

Senior Botter

Kiopa_Matt

Banned

mattseh

import this

PureSalt

New member

ToddW

New member

mattseh

import this

tomyates

Web Developer

eliquid

Serpwoo.com

(O_o)

H̨̼̩͐̑͆̀̚&

tomyates

Web Developer

ToddW

New member

tainted

New member

mattseh

import this

tainted

New member

Panic Not

No Need to panic

Ways of detecting GoogleBot

Web Developer

Member

Senior Botter

Banned

import this

New member

New member

import this

Web Developer

Serpwoo.com

H&#794;&#848;&#785;&#838;&#808;&#832;&#828;&#809;&

Web Developer

New member

New member

import this

New member

No Need to panic

H̨̼̩͐̑͆̀̚&