$500 To Whoever Can Do This The Quickest

EricVorheese · Feb 5, 2013

UPDATE: $500, NOT $100

I need urls of articles/blog posts from specific niches, saved to a plaintext file. I will supply the list of source sites for each nice, be sure to mix up the frequency of urls extracted per site.

1. 1 million urls of articles/blog posts from sites in the finance niche
2. 66k urls of articles/blog posts from sites in the automotive niche
3. 66k urls of articles/blog posts from sites in the food niche
4. 66k urls of articles/blog posts from sites in the travel niche
5. 66k urls of articles/blog posts from sites in the entertainment niche
6. 66k urls of articles/blog posts from sites in the health niche
7. 66k urls of articles/blog posts from sites in the sports niche
8. 66k urls of articles/blog posts from sites in the tech niche

Whoever can do this the fastest will get $500 from me, via paypal

Deadline - 11:59 PM EST, Feb 6 2013 - pm me if you think you can do it, but need more time.

PM me if you're tackling this, just to give me a heads up.

Thanks!
E

Perma · Feb 5, 2013

Sounds like fun. I'll give this a try.

skank · Feb 5, 2013

Can't you just do this with Scrapebox?
1. Use the Keyword Scraper to scrape KWs related to main KW
2. Click "Harvest URLS"

(O_o) · Feb 5, 2013

Do you supply the proxies?

EricVorheese · Feb 5, 2013

Perma said:
Sounds like fun. I'll give this a try.

Awesome. Keep me posted.

skank said:
Can't you just do this with Scrapebox?
1. Use the Keyword Scraper to scrape KWs related to main KW
2. Click "Harvest URLS"

I need the urls of the articles/blog posts from specific sites. Can scrapebox help here?

(O_o) said:
Do you supply the proxies?

No, but I can supply the source sites list, that should save some time. (Updated the OP to reflect this)

To anyone else considering this, I can supply the source sites list. Let me know.

skank · Feb 5, 2013

I need the urls of the articles/blog posts from specific sites.

PM Me

EricVorheese · Feb 5, 2013

skank said:
PM Me

pm'ed you

Perma · Feb 5, 2013

After some discussion, I am tapping out. I started writing this in PHP (strike #1), and reading RSS feeds (strike #2). Fun game, but not a proper solution.

My source can be downloaded at:
urlextract.zip

It contains three items:
1. MySQL database structure. Tables for Niches, Feeds (belongs to Niche), Links (belongs to Feed)
2. "extract.php" - which reads all feeds, grabs the links, and stores them.
3. "cleanfeedburner.php" - specifically designed for FeedBurner RSS feeds which do not contain the destination URL. This script hits every FeedBurner link, reads where it 301s to, and updates the links we already stored.

I was planning to insert a bunch of feeds to the db from a site like this and then running the script.

EricVorheese · Feb 5, 2013

Perma said:
After some discussion, I am tapping out. I started writing this in PHP (strike #1), and reading RSS feeds (strike #2). Fun game, but not a proper solution.

My source can be downloaded at:
urlextract.zip

It contains three items:
1. MySQL database structure. Tables for Niches, Feeds (belongs to Niche), Links (belongs to Feed)
2. "extract.php" - which reads all feeds, grabs the links, and stores them.
3. "cleanfeedburner.php" - specifically designed for FeedBurner RSS feeds which do not contain the destination URL. This script hits every FeedBurner link, reads where it 301s to, and updates the links we already stored.

I was planning to insert a bunch of feeds to the db from a site like this and then running the script.

Thanks for trying, appreciate it.

showerfull · Feb 5, 2013

Pm Sent

Thanks!

shawnhag · Feb 5, 2013

just chiming in here that this is going to be really difficult to verify. even a proper scraper built out can get thrown off by correct-niche sites with unrelated keywords in body/url/title.

would love to know what your feeding this data to

EricVorheese · Feb 5, 2013

shawnhag said:
just chiming in here that this is going to be really difficult to verify. even a proper scraper built out can get thrown off by correct-niche sites with unrelated keywords in body/url/title.

would love to know what your feeding this data to

Hey shawn,

Actually, we're not going after keywords. We're going after specific sites and the blog posts/articles on these sites. For example, with techcrunch, we don't have to wonder if the scraped post urls are going to related to tech, we know they will be.

emp · Feb 5, 2013

Would like a look, can you send the sources list?

::emp::

Websicosys Team · Feb 5, 2013

Example output:

Code:

http://www.mediatek.com/_en/index.php
http://www.nytimes.com/2013/01/07/technology/07iht-mediatek07.html?_r=0
http://www.chinapost.com.tw/taiwan/business/2013/02/05/369630/MediaTek-Inc.htm
http://www.nytimes.com/2013/01/07/technology/07iht-mediatek07.html?_r=0
http://www.isuppli.com/Mobile-and-Wireless-Communications/MarketWatch/Pages/Smartphones-to-Take-Over-Lead-in-Cellphone-Market-Faster-Than-Expected.aspx
http://www.wired.com/gadgetlab/2013/02/mediatek/
http://www.unwiredview.com/2013/02/05/mediatek-plans-to-ship-200-million-smartphone-chipsets-this-year-dedicated-tablet-cpu-coming-in-q3/
http://www.engadget.com/2013/01/23/china-mobile-td-scdma-2013/
http://www.qualcomm.com/media/releases/2013/01/21/qualcomm-reference-design-program-achieves-growth-device-launches-bringing
http://www.unwiredview.com/2013/01/24/to-expand-in-china-qualcomm-promises-white-box-smartphone-makers-help-going-global

EricVorheese · Feb 6, 2013

Websicosys Team said:

Example output:

Code:

http://www.mediatek.com/_en/index.php
http://www.nytimes.com/2013/01/07/technology/07iht-mediatek07.html?_r=0
http://www.chinapost.com.tw/taiwan/business/2013/02/05/369630/MediaTek-Inc.htm
http://www.nytimes.com/2013/01/07/technology/07iht-mediatek07.html?_r=0
http://www.isuppli.com/Mobile-and-Wireless-Communications/MarketWatch/Pages/Smartphones-to-Take-Over-Lead-in-Cellphone-Market-Faster-Than-Expected.aspx
http://www.wired.com/gadgetlab/2013/02/mediatek/
http://www.unwiredview.com/2013/02/05/mediatek-plans-to-ship-200-million-smartphone-chipsets-this-year-dedicated-tablet-cpu-coming-in-q3/
http://www.engadget.com/2013/01/23/china-mobile-td-scdma-2013/
http://www.qualcomm.com/media/releases/2013/01/21/qualcomm-reference-design-program-achieves-growth-device-launches-bringing
http://www.unwiredview.com/2013/01/24/to-expand-in-china-qualcomm-promises-white-box-smartphone-makers-help-going-global

Looks neat. The only problem is that most sites only show the latest 10-30 posts in their rss feed. So it's virtually impossible to get 1 million URLs from 500 sites.

The only way to do this is to scrape. There's a member working on this as we speak. I'll keep you posted.

Meatytreats · Feb 6, 2013

EricVorheese said:
Looks neat. The only problem is that most sites only show the latest 10-30 posts in their rss feed. So it's virtually impossible to get 1 million URLs from 500 sites.

The only way to do this is to scrape. There's a member working on this as we speak. I'll keep you posted.

You know you can iterate through the pages right?

/feed/?paged=1
/feed/?paged=2
/feed/?paged=3
/feed/?paged=4

EricVorheese · Feb 6, 2013

Meatytreats said:
You know you can iterate through the pages right?

/feed/?paged=1
/feed/?paged=2
/feed/?paged=3
/feed/?paged=4

Yeah, but for some reason, the iterations breaks for a large chunk of the feedburner feeds. It does work for plain xml feeds though, like the TC feed.

bodhi321 · Feb 6, 2013

Not tackling this myself. I've conceded defeat to the RSS beast many moons ago. But, I am following this thread. Interested in the mechanics of the solution.

EricVorheese · Feb 7, 2013

Skank wins the challenge! He used scrapebox. I'm sure he'll tell us all how he did it. (My guess - doing a google search for "site:domain.com" and scraping the results).

Waiting for paypal ID so I can send him his money

EDIT: Payment sent

AdamC · Feb 7, 2013

site:domain usually only returns 1000 results max, could he have just spidered the site or got a sitemap from it?

$500 To Whoever Can Do This The Quickest

One incredibly angry baby

Freelance PHP'er by Night

New member

H&#794;&#848;&#785;&#838;&#808;&#832;&#828;&#809;&

One incredibly angry baby

New member

One incredibly angry baby

Freelance PHP'er by Night

One incredibly angry baby

Banned

Click for details

One incredibly angry baby

New member

websicosys.com

One incredibly angry baby

New member

One incredibly angry baby

New member

One incredibly angry baby

&#9568;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&

H̨̼̩͐̑͆̀̚&

╠══════&