Ways of detecting GoogleBot

tomyates

Web Developer
Nov 25, 2010
437
1
0
UK, England
Hi Guys

Has anyone got any info / research of complex or simple ways of detecting googlebots Via PHP / Javascript or a combination.

I know of the standard User Agent check (however google sometimes hides this or acts as a normal web user)

I know of building a list of google IPs and identifying that way.

What other ways have you found?

Does the google bot store cookies and sessions?

Is there any tel tale signs of it in the $_SERVER variable?


Please post your thoughts

Cheers
T
 


Does the google bot store cookies and sessions?

The GoogleBots I run into don't store cookies. I'm not sure about sessions though. Also I noticed they never come from a referring url, just land on the page. I can only assume they crawl a site, get a list of links, then go down the list without continuous spidering, at least that's what I noticed.
 
When the User-Agent HTTP header contains "Googlebot"

If you're looking to detect when a Google employee visits your site, you're not going to be able to, so don't bother trying.
 
When the User-Agent HTTP header contains "Googlebot"

If you're looking to detect when a Google employee visits your site, you're not going to be able to, so don't bother trying.

Sometimes the Google Bot visits with normal UA.

OP, Google Bot probably uses a heavily customized webkit, maybe it doesn't match its features perfectly with the major browsers :thumbsup:
 
Detecting GoogleBot and MSNBot with PHP:

At work I’m working on a project to publish a large number of subscription only pages to the web, to be indexed by Google, and hopefully drum up some interest in the website. Whilst we want Google to have full access to the documents (to index, not cache them), anybody who isn’t a search engine will have to cough up money to get the full text of the document (hence no-cache).


function isCrawler()
{

$ua=$_SERVER['HTTP_USER_AGENT'];
$ip=$_SERVER['REMOTE_ADDR'];

// printf("User agent is %s
", $ua);
// printf("Remote IP is %s
", $ip);
// printf("RDNS is %s
", gethostbyaddr($ip));

// If this looks like a crawler
if (
(strpos($ua, "Googlebot")!==false) ||
(strpos($ua, "MSNBot")!==false)
)
{

// print "Possibly crawler based on user-agent string
";

// Check the reverse DNS
$rdns=gethostbyaddr($ip);
if(
(substr($rdns, -15, 15)==".search.msn.com") ||
(substr($rdns, -14, 14)==".googlebot.com")
)
{

// print "RDNS matches ($rdns)
";

// Check that the RDNS and FDNS match up
// - somebody may have spoofed an IP to PTR to *.googlebot.com
$fwd=gethostbynamel($rdns);
if(in_array($ip, $fwd))
{

// print "FWD DNS matches
";
return true;

} else {

// Failed DNS -> RDNS check
return false;

}

} else {

// Failed RDNS check
return false;

}

} else {

// Failed user-agent check
return false;

}

return false;

}
 
^-- You covered the VERY basics.

RDNS on the fly SUCKS for user experience too.

I can tell you after analyzing millions of IPs Google Employees and Google Bots won't be filtered by what you've done. You're going to do just enough that G will figure out you're doing something fishy ;) and could penalize you for serving different content to them than the "average" Visitor.



Good luck getting someone to share this, let alone share it in a public forum.

What was posted above is about as specific as most people will get.
 
At first I was gonna be mad at that code posted above, then I LOL'd as I seen it wasn't what I thought.

Another day!
 
All google IPs are publicly listed at ARIN. Scrape them and blacklist them. This has been good enough for me, and I've been running cloaked sites for over a year.

Note: Useragent check isn't good enough. I've caught googlebot faking an IE 6 useragent.
 
All google IPs are publicly listed at ARIN. Scrape them and blacklist them. This has been good enough for me, and I've been running cloaked sites for over a year.

Note: Useragent check isn't good enough. I've caught googlebot faking an IE 6 useragent.

Is this disinfo? The seed of doubt is now in your mind ;)

Ping me on skype sometime bro, we have the same timezone for now :)
 
It's much harder than simply scraping Arin records for GOGL. Google has IP addresses listed registered to multiple corporate entities/handles. It's up to you to figure out all of them and scrape them ;) The same can be said for Facebook (although they're much easier to cloak). Still figuring out Bing.

I am on Skype 24/7. Hit me up anytime matt ^^.