$500 To Whoever Can Do This The Quickest

Status
Not open for further replies.

EricVorheese

One incredibly angry baby
Jul 11, 2009
529
6
0
Rating - 100%
35   0   0
UPDATE: $500, NOT $100

I need urls of articles/blog posts from specific niches, saved to a plaintext file. I will supply the list of source sites for each nice, be sure to mix up the frequency of urls extracted per site.

1. 1 million urls of articles/blog posts from sites in the finance niche
2. 66k urls of articles/blog posts from sites in the automotive niche
3. 66k urls of articles/blog posts from sites in the food niche
4. 66k urls of articles/blog posts from sites in the travel niche
5. 66k urls of articles/blog posts from sites in the entertainment niche
6. 66k urls of articles/blog posts from sites in the health niche
7. 66k urls of articles/blog posts from sites in the sports niche
8. 66k urls of articles/blog posts from sites in the tech niche

Whoever can do this the fastest will get $500 from me, via paypal

Deadline - 11:59 PM EST, Feb 6 2013 - pm me if you think you can do it, but need more time.

PM me if you're tackling this, just to give me a heads up.

Thanks!
E
 
Last edited:


Sounds like fun. I'll give this a try.
Awesome. Keep me posted.

Can't you just do this with Scrapebox?
1. Use the Keyword Scraper to scrape KWs related to main KW
2. Click "Harvest URLS"
I need the urls of the articles/blog posts from specific sites. Can scrapebox help here?

Do you supply the proxies?
No, but I can supply the source sites list, that should save some time. (Updated the OP to reflect this)


To anyone else considering this, I can supply the source sites list. Let me know.
 
After some discussion, I am tapping out. I started writing this in PHP (strike #1), and reading RSS feeds (strike #2). Fun game, but not a proper solution.

My source can be downloaded at:
urlextract.zip

It contains three items:
1. MySQL database structure. Tables for Niches, Feeds (belongs to Niche), Links (belongs to Feed)
2. "extract.php" - which reads all feeds, grabs the links, and stores them.
3. "cleanfeedburner.php" - specifically designed for FeedBurner RSS feeds which do not contain the destination URL. This script hits every FeedBurner link, reads where it 301s to, and updates the links we already stored.

I was planning to insert a bunch of feeds to the db from a site like this and then running the script.
 
After some discussion, I am tapping out. I started writing this in PHP (strike #1), and reading RSS feeds (strike #2). Fun game, but not a proper solution.

My source can be downloaded at:
urlextract.zip

It contains three items:
1. MySQL database structure. Tables for Niches, Feeds (belongs to Niche), Links (belongs to Feed)
2. "extract.php" - which reads all feeds, grabs the links, and stores them.
3. "cleanfeedburner.php" - specifically designed for FeedBurner RSS feeds which do not contain the destination URL. This script hits every FeedBurner link, reads where it 301s to, and updates the links we already stored.

I was planning to insert a bunch of feeds to the db from a site like this and then running the script.
Thanks for trying, appreciate it.
 
just chiming in here that this is going to be really difficult to verify. even a proper scraper built out can get thrown off by correct-niche sites with unrelated keywords in body/url/title.

would love to know what your feeding this data to :)
 
just chiming in here that this is going to be really difficult to verify. even a proper scraper built out can get thrown off by correct-niche sites with unrelated keywords in body/url/title.

would love to know what your feeding this data to :)
Hey shawn,

Actually, we're not going after keywords. We're going after specific sites and the blog posts/articles on these sites. For example, with techcrunch, we don't have to wonder if the scraped post urls are going to related to tech, we know they will be.
 
9PG5MGk.png


Example output:

Code:
http://www.mediatek.com/_en/index.php
http://www.nytimes.com/2013/01/07/technology/07iht-mediatek07.html?_r=0
http://www.chinapost.com.tw/taiwan/business/2013/02/05/369630/MediaTek-Inc.htm
http://www.nytimes.com/2013/01/07/technology/07iht-mediatek07.html?_r=0
http://www.isuppli.com/Mobile-and-Wireless-Communications/MarketWatch/Pages/Smartphones-to-Take-Over-Lead-in-Cellphone-Market-Faster-Than-Expected.aspx
http://www.wired.com/gadgetlab/2013/02/mediatek/
http://www.unwiredview.com/2013/02/05/mediatek-plans-to-ship-200-million-smartphone-chipsets-this-year-dedicated-tablet-cpu-coming-in-q3/
http://www.engadget.com/2013/01/23/china-mobile-td-scdma-2013/
http://www.qualcomm.com/media/releases/2013/01/21/qualcomm-reference-design-program-achieves-growth-device-launches-bringing
http://www.unwiredview.com/2013/01/24/to-expand-in-china-qualcomm-promises-white-box-smartphone-makers-help-going-global
 
9PG5MGk.png


Example output:

Code:
http://www.mediatek.com/_en/index.php
http://www.nytimes.com/2013/01/07/technology/07iht-mediatek07.html?_r=0
http://www.chinapost.com.tw/taiwan/business/2013/02/05/369630/MediaTek-Inc.htm
http://www.nytimes.com/2013/01/07/technology/07iht-mediatek07.html?_r=0
http://www.isuppli.com/Mobile-and-Wireless-Communications/MarketWatch/Pages/Smartphones-to-Take-Over-Lead-in-Cellphone-Market-Faster-Than-Expected.aspx
http://www.wired.com/gadgetlab/2013/02/mediatek/
http://www.unwiredview.com/2013/02/05/mediatek-plans-to-ship-200-million-smartphone-chipsets-this-year-dedicated-tablet-cpu-coming-in-q3/
http://www.engadget.com/2013/01/23/china-mobile-td-scdma-2013/
http://www.qualcomm.com/media/releases/2013/01/21/qualcomm-reference-design-program-achieves-growth-device-launches-bringing
http://www.unwiredview.com/2013/01/24/to-expand-in-china-qualcomm-promises-white-box-smartphone-makers-help-going-global
Looks neat. The only problem is that most sites only show the latest 10-30 posts in their rss feed. So it's virtually impossible to get 1 million URLs from 500 sites.

The only way to do this is to scrape. There's a member working on this as we speak. I'll keep you posted.
 
Looks neat. The only problem is that most sites only show the latest 10-30 posts in their rss feed. So it's virtually impossible to get 1 million URLs from 500 sites.

The only way to do this is to scrape. There's a member working on this as we speak. I'll keep you posted.


You know you can iterate through the pages right?

/feed/?paged=1
/feed/?paged=2
/feed/?paged=3
/feed/?paged=4
 
You know you can iterate through the pages right?

/feed/?paged=1
/feed/?paged=2
/feed/?paged=3
/feed/?paged=4
Yeah, but for some reason, the iterations breaks for a large chunk of the feedburner feeds. It does work for plain xml feeds though, like the TC feed.
 
Not tackling this myself. I've conceded defeat to the RSS beast many moons ago. But, I am following this thread. Interested in the mechanics of the solution.
 
Skank wins the challenge! He used scrapebox. I'm sure he'll tell us all how he did it. (My guess - doing a google search for "site:domain.com" and scraping the results).

Waiting for paypal ID so I can send him his money :)

EDIT: Payment sent
 
Last edited:
site:domain usually only returns 1000 results max, could he have just spidered the site or got a sitemap from it?
 
Status
Not open for further replies.