HTMLSQL PHP scraping class

ambirex · Nov 28, 2007

Here is a little php class (not mine) that has come in very handy for me:

jonasjohn.de: htmlSQL - a PHP class to query the web by an SQL like language

Basically it has a simple SQL syntax for scraping.

It hasn't been updated in a year and a half but it works pretty good. I was thinking about forking it and updating it. Anybody want to add anything to their wish list?

They have already listed these todo's:

enhance the HTML parser
test htmlSQL with invalid and bad HTML files
replace the ugly eval() method for the WHERE statement with an own method
more error checks
include the LIMIT function/method like in SQL

Here are some sample queries:

SELECT * FROM * WHERE ($tagname == "h1" or $tagname == "h2" or $tagname == "h3")

SELECT href as url, text FROM a WHERE preg_match("/^http:\/\//", $href)

HairyHun · Nov 28, 2007

on the concept level, i think it'd be nice to have nice, userfriendly was of managing the output.

So that a non-technical person can set it up and use it in a meaningful way.

For this maybe some scrapping experts should give some common forms for output that are usefull.

Some sort of export toward Wordpress might be nice with an option to time the entries with a certain interval, or randomly ( that is if you are scrapping news )

For directory scrapping, perhaps be able to reorganize the categories on the fly.

This is all concept-level talk, when it gets down to technical level it might ont make sense or would be insanely difficult.

My 2 CAD cents

HH

ambirex · Nov 28, 2007

Give me some concrete examples and I'll see about knocking them out.

HairyHun · Nov 28, 2007

For blogs, here is something i would find very useful:

It would have two sides:

One side would be the scrapping. It would find stuff on the search engines (and/or other sources) using : keywords AND/OR parameters like length, keyword density, keyword consistency ( basically make you get a good relevant article and only the article )

The important is to not be limited to RSS feeds.

Then display the results in a table or manner where I quickly overview the content scrapped and delete what is shit and reorder the rest

Second part: Export
Be able to set the export toward a wordpress blog and be able to set the time intervals in terms of time (every two hours) and/or semi-randomness (a random interval between 1 and 3 hours ). setting the categories and other attributes could be a nice option.

what do you think?

HH

ambirex · Nov 29, 2007

The first side:

That might be a little overly broad. Scraping works best for specific task. The more definition you can give the script the better. Otherwise you end up dealing with a lot of edge cases. But I have an idea on how you could achieve something like that. I'll look into it.

second side:

once you have the data it is easy to export it to where ever you want.

SeanW · Nov 29, 2007

ambirex said:
SELECT * FROM * WHERE ($tagname == "h1" or $tagname == "h2" or $tagname == "h3")

SELECT href as url, text FROM a WHERE preg_match("/^http:\/\//", $href)

If you grok Perl there's a few modules good for scraping, especially HTML::TreeBuilder.

The second one becomes something like

my @urls = $tree->look_down("_tag" => "href");

The first one would look like

my @headings = $tree->look_down( sub { $_[0]->tag =~ /h[12]/});

As a different example, here's one that returns stuff within paragraph tags that's probably real text as opposed to footer, nav links, etc.

my @p = $tree->look_down("_tag" => "p",
sub {
length ($_[0]->as_text) > 50 and
$_[0]->as_text =~ /\./ and
$_[0]->as_text !~ /(copyright|http)/i and
scalar (my @a = $_[0]->look_down("_tag" => "a")) /
scalar ( my @foo = split /\s+/, ($_[0]->as_text)) < .1;
});

Sean

HairyHun · Nov 29, 2007

Once you you got the data, there is a lot you can do.

What I meant, is that for the most common uses ( for example word press export) have a nice UI to do it with easy to use settings like the intervals I mentioned.

The overall idea would be to be able to make an autoblog (ok, semi-auto, but very low maintenance), that has amazing content and could climb in the SERP.

HH

Search

Search

HTMLSQL PHP scraping class

ambirex

Hyper Global Media

HairyHun

Masturbating Bandit

ambirex

Hyper Global Media

HairyHun

Masturbating Bandit

ambirex

Hyper Global Media

SeanW

Janitor

HairyHun

Masturbating Bandit