Do people still want databases?



Those of you who want a Yelp scraper (honestly I'm kind of interested in this too)...how do you think I should set it up?

I'm thinking you give it your query (or queries) and it outputs a .csv. For every attribute it finds, it makes a new column for that attribute and fills it in whenever it can. That way you can sort by attributes, etc. It would also grab the Yelp business id (i.e. the slug from the URL of a business's page) and save a bunch of text files, each one named for a specific business, containing the reviews.

Thoughts?

Oh also does anyone know if I am going to hit any kind of query or page limit? I know the API has one built in but I don't know if their actual site checks.
 
Those of you who want a Yelp scraper (honestly I'm kind of interested in this too)...how do you think I should set it up?

I'm thinking you give it your query (or queries) and it outputs a .csv. For every attribute it finds, it makes a new column for that attribute and fills it in whenever it can. That way you can sort by attributes, etc. It would also grab the Yelp business id (i.e. the slug from the URL of a business's page) and save a bunch of text files, each one named for a specific business, containing the reviews.

Thoughts?

Oh also does anyone know if I am going to hit any kind of query or page limit? I know the API has one built in but I don't know if their actual site checks.

It is really hard to stay out of this subject but since this is ShS I hope you don't take offense.

I already have a simple Yelp scraper on the market in the BST as LocalScraper (LocalScraper.com) which scrapes simple data from 5 different sources. Its gotten 0 attention and has sold one copy despite the insane drop in price I gave it to get some business. My own fault for marketing it like shit.

I had also had a few people contact me from its webpage about a more advanced Yelp scraper and I have already completed one that grabs everything but the reviews and images and outputs it to csv.

The first one I made took a search string like "Salons" and a location and then output all the businesses from the search. The new one I have made (releasing in a day or two) goes by category or subcategory and scrapes the entire thing rather than by keyword. I have found that Yelp includes reviews for search data, so if you look for "salon" you could get unrelated business details if a review had the word "salon" in it.

In developing both of these bots I have never hit a Yelp wall in regards to page count. I have done 300 pages in a run on test projects with out a problem. But I also have enough random delays in place to mimic a real user, and proxy support if that fails.

Again, sorry for shitting on your thread here. But its hard to see people looking for a product when you already have one for sale on the site and our about to release one that people are talking about.
 
Golf course DB is done. It's sweet. 19,000 courses. Reviews for all of them when they were available (50,000 reviews). Ratings, etc. Also includes a junction table containing data for how far each course is from nearby cities (so you can search for all courses within X miles of Y city). Right now I'm just putting together a readme.txt with some common queries (some are kind of difficult with the junction table).

PM me if you are going to want it. Price will likely be $39.
 
I'll take custom orders if the price is right - otherwise I'd be better off just making one for everybody.

BTW, golf database is up for sale, check link in signature.
 
yelp scraper

I also wrote a perl script that gets the attributes as well as the reviews (in a separate text file). I keep getting blocked by yelp though. Anyone has experience with proxies to get around that? any recommendations?
 
I also wrote a perl script that gets the attributes as well as the reviews (in a separate text file). I keep getting blocked by yelp though. Anyone has experience with proxies to get around that? any recommendations?

Random time limits, randomise the order you hit the URLs, proxies are a good idea, realistic referrers never hurt.
 
PMing you - Let's talk business. I need some scraping done within a few months so maybe we can work something out. It will be a custom job and I'm going to want source code so that I can continue to scrape. Was also gonna buzz MattSeh when it all goes down.
 
yelp scraper

Also don't run 30,000 hits at once in an hour. My IP is banned haha

My ips keep getting banned and i keep changing them :)
I'm happy to share my script (perl) for yelp scraping if someone can help add proxy support with a list of proxies and other tricks to get around being banned. Yelp is by far the strictest site i've ever scraped.