Yeh, so pretty much what I did was a combination of scraping the website's internal links and using sitemaps.
I used excel to concatenate "sitemap.xml" to each source website. Then used the SB addon sitemap scraper to pull those URLs. That gave me about half of the results.
For the websites where domain.com/sitemap.xml didn't exist I used the link extractor addon in SB to extract all the internal links. Then I reloaded those into the link extractor and repeated and repeated and repeated. That would turn 100 links into 1,000 into 15,000 and so on.
For yahoo I looked at their sitemap on http://finance.yahoo.com/sitemap.xml and then found the link to their other sitemap: http://finance.yahoo.com/sitemap/stories/index.xml. If you look at that all the stories are saved into different sitemaps: http://finance.yahoo.com/sitemap/stories/1.xml http://finance.yahoo.com/sitemap/stories/2.xml. So I just iterated that over and over again and ran it through SB sitemap extractor. That gave me a couple 100K URLS
For Moneywatch I used SB sitemap extrator to get the links from their main sitemap: http://articles.marketwatch.com/sitemap.xml. Those links are their sub sitemaps. I ended up getting like 300K from all of them.
All I ended up using was:
Scrapebox and Addons
Excel
TextPad
Laptop with i7 processor and 16GB of memory
PS: Thanks EricVorheese for the $500! I'm putting it on Red in Vegas this weekend
I used excel to concatenate "sitemap.xml" to each source website. Then used the SB addon sitemap scraper to pull those URLs. That gave me about half of the results.
For the websites where domain.com/sitemap.xml didn't exist I used the link extractor addon in SB to extract all the internal links. Then I reloaded those into the link extractor and repeated and repeated and repeated. That would turn 100 links into 1,000 into 15,000 and so on.
For yahoo I looked at their sitemap on http://finance.yahoo.com/sitemap.xml and then found the link to their other sitemap: http://finance.yahoo.com/sitemap/stories/index.xml. If you look at that all the stories are saved into different sitemaps: http://finance.yahoo.com/sitemap/stories/1.xml http://finance.yahoo.com/sitemap/stories/2.xml. So I just iterated that over and over again and ran it through SB sitemap extractor. That gave me a couple 100K URLS
For Moneywatch I used SB sitemap extrator to get the links from their main sitemap: http://articles.marketwatch.com/sitemap.xml. Those links are their sub sitemaps. I ended up getting like 300K from all of them.
All I ended up using was:
Scrapebox and Addons
Excel
TextPad
Laptop with i7 processor and 16GB of memory
PS: Thanks EricVorheese for the $500! I'm putting it on Red in Vegas this weekend