Scraping Websites Using PHP DOM



thank you for your tools, i ever try scraping website using DOM, you must need a large filtering for content, after scrapping.
 
xpaths are the way to go. Regex is far too much complicated (although it always works).

Just get a xpaths addon onto firefox and a script and you are ready to scrape just about any site.
 
we just use PHP Simple HTML DOM Parser for all our scrapers :) it's absolutely awesome - I can whip up a scraper for literally any site in a few minutes with it.

As opposed to the DOM parser that's already native to PHP and performs the same, if not better function?

Don't get me wrong, props for sharing another parser, but there's no sense in adding more bloat to an already bloated language and hog more server resources without reason.
 
As opposed to the DOM parser that's already native to PHP and performs the same, if not better function?

Don't get me wrong, props for sharing another parser, but there's no sense in adding more bloat to an already bloated language and hog more server resources without reason.

True, but if you're going to fuck up writing a scraper you best fuck it up right IMHO. The DOM parser mentioned works much the same way a jQuery statement works and it works pretty slick. The cost is lots of memory overhead, but hell if you're not smart enough to write a scraper in any other way then who the fuck cares?
 
True, but if you're going to fuck up writing a scraper you best fuck it up right IMHO. The DOM parser mentioned works much the same way a jQuery statement works and it works pretty slick. The cost is lots of memory overhead, but hell if you're not smart enough to write a scraper in any other way then who the fuck cares?

not smart enough or maybe time is money and you want the quickest way.

for people using the PHP DOM for parsing, how does it deal with broken HTML? is there a good html cleaner in PHP like HtmlCleaner Project Home Page for java?
 
the parser is OK, the curl though is fucked up and very plain/textbook ( slow ).

maybe he should have implemented a non blocking, rolling version of multi curl to speed that fucking shit up so when he is taking up 20-100% CPU on those calls with PHP he could be grabbing a lot more links and still be easy on the site he is scraping.