Coding a Python rank tracker

iamjon · Feb 1, 2012

Up until now I've been using the Market Samurai Rank Tracker and it's been good enough for what I've needed, but with all the bullshit that's happened recently, I think I'd prefer to look after rank tracking myself.

I'm normally a PHP developer (using Wordpress for simple stuff and CodeIgniter for more complicated things) but I've been having a play with Python recently and I like what I see.

I know this won't be the quickest way to write this but it's a learning project too.

I've been playing around with the Scrapy framework and also written a few little scripts and scrapers but I think that a rank tracker could be a decent project to get my teeth stuck into Python with.

For those of you with any relevant experience, I'd appreciate any pointers that might be useful.

I'm only going to bother with checking the top 100 results in Google for now as this should limit the need for proxies to start with at least. I'm more worried about just getting the data into a database to start with and I can run manual queries on that, but I'll probably build out some sort of front-end at some point.

I'm interested in running the following kinds of reports/queries for each domain:

- 24 hour rank movement
- 7 day rank movement
- 30 day rank movement
- New domains entering top 100 in last 7/30 days for competing keywords

I've developed plenty of databases in the past but nothing more than a few thousand records so I'm wondering if this is the right way to go about setting up the database:

Domains table: Stores minimal info about the each domain being tracked and mainly present for an eventual front-end to the database so that I can return all rankings for every keyword associated with a domain

Following fields:

- domain_id
- domain_name

Keywords table: Stores all keywords being tracked and the relevant country for local tracking. Used to populate the list of keywords that needs to be scraped.

Following fields:

- keyword_id
- keyword_name
- keyword_locale

DomainKeywords table: Many-to-many link table between the above two for running reports/queries
- domainkeyword_id
- domain_id
- keyword_id

Serps table: This is where all the data will go with possibly the following fields:

- serp_id: Primary key
- keyword_id: Foreign key to the keywords table
- serp_rank: Position in the SERPs
- serp_datetime: Date/time that the result was pulled

I was thinking that I'd like to track data hourly, but I'm not sure whether I'll end up with too much data if I'm tracking a lot of keywords.

Do you think this will be an issue?

I was thinking that if I only check the rankings every 12 hours, I can store more data for competitor sites which would be useful for running automated reports to identify new competitors and track the movement of existing ones.

I know this part of the forum is pretty quiet so hopefully this thread might develop into something useful for others as well.

Any other tips you might have?

- Problems I might run into scraping Google compared to sites that aren't prepared for scraping
- Recommended libraries

Thanks for reading and for any advice

TL;DR: Looking for advice on coding a personal rank tracker using Python to replace Market Samurai.

Falian · Feb 1, 2012

I definitely don't recommend hourly tracking, there's little practical benefit to it. You'll also increase your scraping resources and expenses by 24x (when compared to daily tracking).

Stick with Daily Tracking and you're gonna need a good amount of proxies to effectively scrape Google.

Mattseh is a great Python developer, so try getting in touch with him.

iamjon · Feb 1, 2012

Thanks Falian

With the recommendation about proxies, are you suggesting that the challenge would be in preventing Google's from recognising the automated queries, or more for standardisation of rankings across multiple IPs? Or both?

I was assuming that even with as few as 5 proxies, each making 1 request per minute, I could scrape the top 100 results for 300 keywords every hour.

Is there something I'm missing here or do you think that would be feasible?

Thanks again

Falian · Feb 1, 2012

Yes, I recommend proxies because of Google recognizing the automated queries. 1 request per minute may be too much, but only testing will give you that answer.

Once again, scraping every hour isn't practical. Your data storage is going to be 24x larger and your going to scrape 24x more often. There's no real benefit other than being able to refresh your stats every hour.

mattseh · Feb 1, 2012

will sell mostly complete python rank tracker for $20k.

denormalization (mongo?) is your friend.

iamjon · Feb 1, 2012

@Falian: Yeah thanks, I guess you're right about finding out what I can get away with when I start testing. I've totally dropped the idea of hourly scraping now and agree that most of the hourly data would be completely redundant anyway.

@mattseh: Cheers for the offer but I think the $20k tracker might be a bit of my league if I'm looking to replace a shitty £50 tool and learn some Python

Thanks for the denormalisation tip. I've only ever really learned mySQL and never looked back or considered an alternative so this sounds pretty cool.

Out of interest is this something you've adopted as standard for your own projects or does it still totally depend the data being stored?

Cheers

If anyone's reading this and hasn't come across MongoDB before, the following was really helpful:

MongoDB vs. RDBMS Schema Design

chatmasta · Feb 1, 2012

If you're just tracking your own domains and you are only storing their rank at each check, you will be completely ok with MySQL. It can handle millions of records totally fine.

Then again if it's a learning project and you want to learn MongoDB, there's no reason not to integrate it other than the learning curve. Personally I would keep at least one part of the project something I'm familiar with (i.e. MySQL in your case) and make the rest the learning aspect of it.

profitmotive · Feb 1, 2012

I also wouldn't bother with proxies if it's just one run per day on several keywords.

If you're doing serious bulk then thats another story.

Oh yeah, uBot.

j0hnsmith · Feb 2, 2012

Look into django, maybe overkill for this but once you have your rank tracker you'll probably want to start adding more tools and it's good shit to learn.

dchuk · Feb 2, 2012

Bofu2U · Feb 2, 2012

Rage9 · Feb 2, 2012

dchuk said:

LOL hipsters. hipster LOL.

dchuk · Feb 2, 2012

Rage9 said:
LOL hipsters. hipster LOL.

it's 7 years old now. hipsters use node.js

Hale.Pane · Feb 2, 2012

YAY! This thread is relevant to my interest (python noob).

mattseh · Feb 2, 2012

dchuk said:
it's 7 years old now. hipsters use node.js

wake up and smell the coffeescript brah

Hale.Pane · Feb 2, 2012

I raise your coffeescript with Moonscript

kamired2 · Feb 2, 2012

iamjon said:
Thanks Falian

With the recommendation about proxies, are you suggesting that the challenge would be in preventing Google's from recognising the automated queries, or more for standardisation of rankings across multiple IPs? Or both?

I was assuming that even with as few as 5 proxies, each making 1 request per minute, I could scrape the top 100 results for 300 keywords every hour.

Is there something I'm missing here or do you think that would be feasible?

Thanks again

don't buy the RoR hype, it's mostly marketing

also i think your math is off on that 100 / 10 results per page * 300 keywords / 60 requests per minute = 50 proxies needed

If I was doing this I would separate the scraping part from the analysis. Just dump the SERPs into whatever and then run your queries on that instead of putting too much thought into how you want to organize it.

Bofu2U · Feb 2, 2012

kamired2 said:
don't buy the RoR hype, it's mostly marketing

also i think your math is off on that 100 / 10 results per page * 300 keywords / 60 requests per minute = 50 proxies needed

If I was doing this I would separate the scraping part from the analysis. Just dump the SERPs into whatever and then run your queries on that instead of putting too much thought into how you want to organize it.

Pretty sure he was talking about &num=100

dchuk · Feb 2, 2012

kamired2 said:
don't buy the RoR hype, it's mostly marketing

BLASPHEMY

kamired2 · Feb 2, 2012

Bofu2U said:
Pretty sure he was talking about &num=100

Didn't know that was possible / a viable scraping technique. Thanks for enlightening me tho

dchuk said:
BLASPHEMY

I've got no problem with RoR. The hype has spawned a good bunch of frameworks, a a bunch of them half-baked but many can compete with (or are even better than) RoR. But let's not get into the eternal debate.

@OP just get to coding. When I first started with coding I read the Python tutorial for 30 minutes and then started hacking away. The code I produced was a total abomination but I learned quickly. I still look at even pretty recent code of mine and go facepalm.jpg but that's good, means I never stop learning.

Coding a Python rank tracker

Member

Banned

Member

Banned

import this

Member

Well-known member

Grown Ass Man

Enlightened Member

Senior Botter

Automation Specialist

Banned

Senior Botter

IDONTLIFT

import this

IDONTLIFT

Señor Member

Automation Specialist

Senior Botter

Señor Member