How to Scrape Data from PDFs, Flash Sites (and HTML)



Fortunately, there are solid scraping libraries for all web programming languages (most of which are curl & xpath wrappers). Python's Nokogiri equivalent is the lovely Beautiful Soup. PHP users can work directly with curl & xpath.

Though, instead of using Firebug to get the html/css footprint of the data that you want, I instead recommend the more useful Selector Gadget. It's not even a browser add-on, but a bookmarklet you click and drag into your bookmark bar.

What makes it so powerful is that you can click the elements on the website that you want to scrape and it generates the css/xpath selectors that you can copy and paste into whatever scraping engine you use. Even more useful is the ability to deselect the elements you don't want to include, and it'll generate the selector output appropriately.

Until I started using Selector Gadget, I'd find myself having to go back and refine my html element selectors because while <span class="stuff"> might contain what I need, there's also a <span class="stuff footer"> that I hadn't seen before that's getting scraped. Selector Gadget shows you that.

Nice to know about Tesseract and Google Refine, though.