Found a good Robots.txt file for Magento

ANOMOANON

shmanamoanen
Mar 7, 2012
1,770
22
0
A while back I was having some major issues with Magento duplicate content showing up in GWT.

What happens is that the big G crawls your Magento cart and classifies all of your urls that "sort" info in the backend as unique urls with duplicate content. It gets hella messy.

I never saw any negative action toward my sites due to this very common problem, but Im weird and I like everything to be all clean n pretty.

If you are using Magento you may find this useful (copy and paste the follwing into a txt file/ save as robots.txt and up to your ftp root):


# Google Image Crawler Setup
User-agent: Googlebot-Image
Disallow:

# Crawlers Setup
User-agent: *

# Directories
Disallow: /404/
Disallow: /app/
Disallow: /cgi-bin/
Disallow: /downloader/
Disallow: /errors/
Disallow: /includes/
Disallow: /js/
Disallow: /lib/
Disallow: /magento/
Disallow: /media/
Disallow: /pkginfo/
Disallow: /report/
Disallow: /scripts/
Disallow: /shell/
Disallow: /skin/
Disallow: /stats/
Disallow: /var/

# Paths (clean URLs)
Disallow: /index.php/
Disallow: /catalog/product_compare/
Disallow: /catalog/category/view/
Disallow: /catalog/product/view/
Disallow: /catalogsearch/
Disallow: /checkout/
Disallow: /control/
Disallow: /contacts/
Disallow: /customer/
Disallow: /customize/
Disallow: /newsletter/
Disallow: /poll/
Disallow: /review/
Disallow: /sendfriend/
Disallow: /tag/
Disallow: /wishlist/
Dissalow: /catalog/product/gallery/

# Files
Disallow: /cron.php
Disallow: /cron.sh
Disallow: /error_log
Disallow: /install.php
Disallow: /LICENSE.html
Disallow: /LICENSE.txt
Disallow: /LICENSE_AFL.txt
Disallow: /STATUS.txt

# Paths (no clean URLs)
Disallow: /*.js$
Disallow: /*.css$
Disallow: /*.php$
Disallow: /*?SID=
 
  • Like
Reactions: microphone head


PS: If any of you vets lurking around have any suggestions/additions/corrections please don't be shy
 
What happens is that the big G crawls your Magento cart and classifies all of your urls that "sort" info in the backend as unique urls with duplicate content. It gets hella messy.

This is what the rel="canonical" tag is for...
 
Disallow: /cron.php
Disallow: /cron.sh

oh god

Would love a little expansion on this thought



This is what the rel="canonical" tag is for...

I wish it were that simple. For some reason with this CMS its pure confusion hell with the duplicate content situation. Its very very frustrating. I did all of the tricks, even blocked parameters in GWT, and it was still just a fuck show.

For me this has been the only thing that actually worked.
 
It's more difficult to implement, but a combination of <meta robots> and rel=canonical, rel=next/prev, is a better solution.

Robots.txt doesn't prevent indexing of a page, it just prevents crawling. You end up with indexed pages in google that look like this:

LWkufmi.jpg


The better approach is:

  • rel=canonical for things like duplicate entry points for category pages, product pages, etc
  • rel=next,rel=prev for paginated category listings
  • <meta robots="noindex"> for things that shouldn't be indexed at all, like product search results, cron.php and other internal-use pages, etc.

Why is this better? Well, for one, if someone links to a page, rel=canonical keeps the juice. Robots.txt doesn't. It also keeps your indexed page count down, which may have some effect for things like Panda.

If you aren't into hacking support for these things yourself, I assume there are relatively inexpensive 3rd party plugins.