$500 To Whoever Can Do This The Quickest

Status
Not open for further replies.
Yeh, so pretty much what I did was a combination of scraping the website's internal links and using sitemaps.

I used excel to concatenate "sitemap.xml" to each source website. Then used the SB addon sitemap scraper to pull those URLs. That gave me about half of the results.

For the websites where domain.com/sitemap.xml didn't exist I used the link extractor addon in SB to extract all the internal links. Then I reloaded those into the link extractor and repeated and repeated and repeated. That would turn 100 links into 1,000 into 15,000 and so on.

For yahoo I looked at their sitemap on http://finance.yahoo.com/sitemap.xml and then found the link to their other sitemap: http://finance.yahoo.com/sitemap/stories/index.xml. If you look at that all the stories are saved into different sitemaps: http://finance.yahoo.com/sitemap/stories/1.xml http://finance.yahoo.com/sitemap/stories/2.xml. So I just iterated that over and over again and ran it through SB sitemap extractor. That gave me a couple 100K URLS

For Moneywatch I used SB sitemap extrator to get the links from their main sitemap: http://articles.marketwatch.com/sitemap.xml. Those links are their sub sitemaps. I ended up getting like 300K from all of them.

All I ended up using was:
Scrapebox and Addons
Excel
TextPad
Laptop with i7 processor and 16GB of memory


PS: Thanks EricVorheese for the $500! I'm putting it on Red in Vegas this weekend
 


I also had to check some patterns because some internal links I scraped had tracking code on it. That caused SB to think they were different links when in fact it was 5,000 links to the same page

Also used alivecheck to make sure they weren't dead URLs and DupRemove to deal with the file that had 1M+ entries
 
Status
Not open for further replies.