Scraping Websites Using PHP DOM

fm1234 · May 26, 2011

Nice one from Eric Nagel:

Scraping websites using document object model

Designed with merchant datafeeds in mind; ie. to be an active scraper without putting an undue load on the merchant's site.

Frank

bonbon · Jun 4, 2011

thank you for your tools, i ever try scraping website using DOM, you must need a large filtering for content, after scrapping.

howdoyou · Jun 8, 2011

I've scraped websites with php/dom in the past, works very well.

jamesmart121 · Jun 16, 2011

All of this is complete Greek and Latin to me.

xhpdx · Jun 17, 2011

jamesmart121 said:
All of this is complete Greek and Latin to me.

Nah it's not that hard, give it a few hours to test. Oh forgot to add xpath>regex.

Rexibit · Jun 17, 2011

jamesmart121 said:
All of this is complete Greek and Latin to me.

It's not hard to understand. Read this tutorial I made months back on it.

PHP Tutorial 2: Advanced Data Scraping Using cURL And XPATH | Matthew Watts

xpathfucker · Jun 18, 2011

xpaths are the way to go. Regex is far too much complicated (although it always works).

Just get a xpaths addon onto firefox and a script and you are ready to scrape just about any site.

goninme · Jul 25, 2011

we just use PHP Simple HTML DOM Parser for all our scrapers

it's absolutely awesome - I can whip up a scraper for literally any site in a few minutes with it.

Rexibit · Jul 25, 2011

goninme said:
we just use PHP Simple HTML DOM Parser for all our scrapers it's absolutely awesome - I can whip up a scraper for literally any site in a few minutes with it.

As opposed to the DOM parser that's already native to PHP and performs the same, if not better function?

Don't get me wrong, props for sharing another parser, but there's no sense in adding more bloat to an already bloated language and hog more server resources without reason.

Rage9 · Jul 26, 2011

Rexibit said:
As opposed to the DOM parser that's already native to PHP and performs the same, if not better function?

Don't get me wrong, props for sharing another parser, but there's no sense in adding more bloat to an already bloated language and hog more server resources without reason.

True, but if you're going to fuck up writing a scraper you best fuck it up right IMHO. The DOM parser mentioned works much the same way a jQuery statement works and it works pretty slick. The cost is lots of memory overhead, but hell if you're not smart enough to write a scraper in any other way then who the fuck cares?

harrymouni · Aug 2, 2011

Rage9 said:
True, but if you're going to fuck up writing a scraper you best fuck it up right IMHO. The DOM parser mentioned works much the same way a jQuery statement works and it works pretty slick. The cost is lots of memory overhead, but hell if you're not smart enough to write a scraper in any other way then who the fuck cares?

not smart enough or maybe time is money and you want the quickest way.

for people using the PHP DOM for parsing, how does it deal with broken HTML? is there a good html cleaner in PHP like HtmlCleaner Project Home Page for java?

eliquid · Aug 2, 2011

the parser is OK, the curl though is fucked up and very plain/textbook ( slow ).

maybe he should have implemented a non blocking, rolling version of multi curl to speed that fucking shit up so when he is taking up 20-100% CPU on those calls with PHP he could be grabbing a lot more links and still be easy on the site he is scraping.

Search

Search

Scraping Websites Using PHP DOM

fm1234

WF Mega Premium Member

bonbon

New member

howdoyou

Developer

jamesmart121

Banned

xhpdx

New member

Rexibit

Automation, I has it.

xpathfucker

New member

goninme

New member

Rexibit

Automation, I has it.

Rage9

Banned

harrymouni

Active member

eliquid

Serpwoo.com