Freebie ScrapCL: Craigslist Email Scraper

Status
Not open for further replies.

kblessinggr

PedoBeard
Sep 15, 2008
5,723
80
0
G.R., Michigan
www.kbeezie.com
Rating - 96.7%
29   1   0
I spent the last couple hours learning Python, the result of this? A email scraper for craigslist.

Usage (from the command line, Run -> cmd)
scrapcl.exe [starting url] [category] [filename]

example:
scrapcl.exe http://grandrapids.craigslist.org/sys/ sys emails

with the above example results will be saved into emails.txt in the same folder, it saves an email per line

basically it'll follow thru and grab each page until there is no more left. If the provided file name already exists, it'll append onto the end (so if you want to go city by city for one single list you could do that). It's only designed to follow links within the category folder, or index####.html for each page of postings.

it can take several minutes to do a single city/category, depending on your internet speed and computer.

Download: http://www.karlblessing.com/dist/scrapcl.zip
 


Updated Version (0.1.2)

Fixed a couple things and made it faster.

http://www.karlblessing.com/dist/scrapcl012.zip

Usage:
Usage: scrapcl start_url [save_file_name]

start_url must be a craigslist url with a preceeding category,
results are restricted to this category

http://city.craigslist.org/category/

save_file_name is an optional parameter, if no name is provided
emails will be defaulted.

scrapcl http://city.craigslist.org/sys/ syse
will save a file called syse.txt with the emails (one per line)

I simply made it where you can type scrapcl http://city.craigslist.org/cat/ and by default it'll save into emails.txt (otherwise saves into a diff name)

What was fixed:
- Not all postings are strictly in /category/, seattle for example has a lot of /location/category/ so I fixed the code to look for that
- Instead of recursively going deeper and deeper I have it following Craigslist's index### structure, and it stops going when it hits a page with less than 100
postings.
- Scrapper no longer tries to follow links if it already hit the posting page with the email on it.
- It stops at index50000.html (50,000)
- It can be interrupted with Ctrl+C

Coming soon:
- Turn off saving of craigslist forwarders
- Be able to tell it just how many emails you'd like back (only want the first 1,000?)
- Grab other things like the "Location: " and posted date.

(For those confused, you goto Start then run, then "cd c:\path\to\scrapcl" and use the command above)
 
  • Like
Reactions: Jsmith
Thats a nifty little helper, thanks a lot for sharing this. I am sure I have some usage for this :love-smiley-087:
 
Thanx a lot for this man...really cool of you. I'm pretty green when it comes to command lines. I typed in "scrapcl http://city.craiglist.org/category syse" and get

"scrapcl is not recognized as an internal or external command, operable program or batch file"...any pointers for me? Thanx again for this

You will need to run the command for the directory in which the executable is located.

Thanks for the tool!
+Rep.
 
Status
Not open for further replies.