Here is a little php class (not mine) that has come in very handy for me:
jonasjohn.de: htmlSQL - a PHP class to query the web by an SQL like language
Basically it has a simple SQL syntax for scraping.
It hasn't been updated in a year and a half but it works pretty good. I was thinking about forking it and updating it. Anybody want to add anything to their wish list?
They have already listed these todo's:
SELECT * FROM * WHERE ($tagname == "h1" or $tagname == "h2" or $tagname == "h3")
SELECT href as url, text FROM a WHERE preg_match("/^http:\/\//", $href)
jonasjohn.de: htmlSQL - a PHP class to query the web by an SQL like language
Basically it has a simple SQL syntax for scraping.
It hasn't been updated in a year and a half but it works pretty good. I was thinking about forking it and updating it. Anybody want to add anything to their wish list?
They have already listed these todo's:
- enhance the HTML parser
- test htmlSQL with invalid and bad HTML files
- replace the ugly eval() method for the WHERE statement with an own method
- more error checks
- include the LIMIT function/method like in SQL
SELECT * FROM * WHERE ($tagname == "h1" or $tagname == "h2" or $tagname == "h3")
SELECT href as url, text FROM a WHERE preg_match("/^http:\/\//", $href)