There's a DOM library for PHP that I really like. It's really not a bad language. Yes it lacks threading.
I've replicated the same scrapers in PHP, Python, and Ruby in an effort to get a feel for their differences.
All three languages have great DOM libraries that work almost identically.
If you like one, you like them all.
In fact, I've built the website I spoke of in the OP in PHP (CodeIgniter), Python (Django), and I'm now working on the Ruby (Rails) version. They only have basic CRUD, user auth, and I have their scripts on practice timers (daily scrapes atm) that are simulating hundreds of websites with 20 keywords per website to see what's performing the best. Originally, I was going to compare the PHP vs Python scrapers on a large-volume bidaily scrape, but once I had the Python scraper working on a backend routine, I just copied it to my PHP website. Discussion in this thread got me curious, so I'll have to drop the PHP scrape in and see how it fares.
I choose you, Rails.
But I'm mostly looking forward to working with
Ruby on Rails. The new Rails version 3 is great and Ruby/Rails just fits my mental workings better than Python/Django did. I'm building the Rails version of this website as I go through
Beginning Rails 3 by Apress. A quarter of the way through the book and you feel confident you can make 3/4ths of any website.
QUESTION
I doubt anyone will read down this far, but I'm stumped at one point: How should I store the keywords and the daily scrape information for each one?
Here's what I was thinking.
Code:
_______ ___________________
| Users | | Websites |
|----- | |-------------------|
| id |<--+ | id |
|_______| +-->| user_id |
| url |
| keywords_csv_path |
|___________________|
I couldn't think of a database solution that would scale unless I made a table for each website, but from what I've discerned, that's a lot of overhead.
Moreover, I'm probably overlooking the simplest solution.
But here's my idea: Each website for each user could build it's own CSV file that stores its keywords. I'm limiting keywords to 20 per website during development for ease. The scraper will then add a row for each date it scrapes, returning the rank position of each keyword.
So the CSV would look like this:
Code:
[B]Date [COLOR="PaleGreen"]Keyword1 Keyword2 Keyword3[/COLOR][/B]
12/13 NF 18
12/14 NF 17
12/15 99 17
12/16 98 14
12/17 98 13 7
12/18 96 13 6
12/19 95 12 7
12/20 95 9 5
... ... ... ...
- NF indicates it was "not found"
- Keyword3 demonstrates a keyword added later
Any better ideas?
This is pretty shitty should I, say, track Google, Yahoo, and Bing. Dealing with CSVs also suck.