HTMLSQL PHP scraping class

Status
Not open for further replies.

ambirex

Hyper Global Media
Jan 12, 2007
20
0
0
Here is a little php class (not mine) that has come in very handy for me:

jonasjohn.de: htmlSQL - a PHP class to query the web by an SQL like language

Basically it has a simple SQL syntax for scraping.

It hasn't been updated in a year and a half but it works pretty good. I was thinking about forking it and updating it. Anybody want to add anything to their wish list?

They have already listed these todo's:

  • enhance the HTML parser
  • test htmlSQL with invalid and bad HTML files
  • replace the ugly eval() method for the WHERE statement with an own method
  • more error checks
  • include the LIMIT function/method like in SQL
Here are some sample queries:

SELECT * FROM * WHERE ($tagname == "h1" or $tagname == "h2" or $tagname == "h3")

SELECT href as url, text FROM a WHERE preg_match("/^http:\/\//", $href)
 


on the concept level, i think it'd be nice to have nice, userfriendly was of managing the output.

So that a non-technical person can set it up and use it in a meaningful way.

For this maybe some scrapping experts should give some common forms for output that are usefull.

Some sort of export toward Wordpress might be nice with an option to time the entries with a certain interval, or randomly ( that is if you are scrapping news )

For directory scrapping, perhaps be able to reorganize the categories on the fly.

This is all concept-level talk, when it gets down to technical level it might ont make sense or would be insanely difficult.

My 2 CAD cents

HH
 
Give me some concrete examples and I'll see about knocking them out.
 
For blogs, here is something i would find very useful:

It would have two sides:

One side would be the scrapping. It would find stuff on the search engines (and/or other sources) using : keywords AND/OR parameters like length, keyword density, keyword consistency ( basically make you get a good relevant article and only the article )

The important is to not be limited to RSS feeds.

Then display the results in a table or manner where I quickly overview the content scrapped and delete what is shit and reorder the rest

Second part: Export
Be able to set the export toward a wordpress blog and be able to set the time intervals in terms of time (every two hours) and/or semi-randomness (a random interval between 1 and 3 hours ). setting the categories and other attributes could be a nice option.

what do you think?

HH
 
The first side:

That might be a little overly broad. Scraping works best for specific task. The more definition you can give the script the better. Otherwise you end up dealing with a lot of edge cases. But I have an idea on how you could achieve something like that. I'll look into it.

second side:

once you have the data it is easy to export it to where ever you want.
 
SELECT * FROM * WHERE ($tagname == "h1" or $tagname == "h2" or $tagname == "h3")

SELECT href as url, text FROM a WHERE preg_match("/^http:\/\//", $href)

If you grok Perl there's a few modules good for scraping, especially HTML::TreeBuilder.

The second one becomes something like

my @urls = $tree->look_down("_tag" => "href");

The first one would look like

my @headings = $tree->look_down( sub { $_[0]->tag =~ /h[12]/});

As a different example, here's one that returns stuff within paragraph tags that's probably real text as opposed to footer, nav links, etc.

my @p = $tree->look_down("_tag" => "p",
sub {
length ($_[0]->as_text) > 50 and
$_[0]->as_text =~ /\./ and
$_[0]->as_text !~ /(copyright|http)/i and
scalar (my @a = $_[0]->look_down("_tag" => "a")) /
scalar ( my @foo = split /\s+/, ($_[0]->as_text)) < .1;
});


Sean
 
Once you you got the data, there is a lot you can do.

What I meant, is that for the most common uses ( for example word press export) have a nice UI to do it with easy to use settings like the intervals I mentioned.

The overall idea would be to be able to make an autoblog (ok, semi-auto, but very low maintenance), that has amazing content and could climb in the SERP.


HH
 
Status
Not open for further replies.