Intelligent content extractor - is there a demand?

Moxxy · May 7, 2010

I'm working on a tool atm that intelligently extracts the article portion of a web page. No specific regex or prior knowedge of the page structure required, in most cases it will detect which part of the page is the actual article text and pull it out for you or if there's multiple 'zones' let you pull them out one at a time. If it's unable to establish a certain confidence level it will let you know so you don't end up with rubbish.

So basically with a crawler module tacked on it you could plug some key words and extract usable content from all over place...Not for the whitehat at heart obviously... But it opens a lot of new content sources with very little work and get's you off the RSS feeds that everyone else is flogging to death.

It still needs some tuning and I was really only writing it a java lib for myself but I could turn it into a command line app or even a GUI app if there's a demand for it.

So how about it? Would it be useful? Is it already done? Is it worth paying for? How much? What features would you want tacked on?

*pre-emptive reply*
fuck off Extor, no you can't have the source code.

emp · May 7, 2010

Like the pre-emptive in that.

I have seen that once (could not have it, tho) and it was great.
Did have problems with blogs, tho.

Sure would like to get my greedy little paws on it.

::emp::

mattseh · May 7, 2010

Nice pre-emptive. I think there would be more potential in tying it into a full featured app, as then non-programmers (most people) would have a use for it.

webrc1 · May 7, 2010

Readability - An Arc90 Lab Experiment

machinecontrol · May 9, 2010

Check out Instapaper it does this very well.

Moxxy · May 9, 2010

Both those tools work really well, thanks for the pointers... Mine uses a different approach but I got some good tweaking idea's from readability. Performance is about equivalent to readability. The advantage for me with my own version though is that it's client side, which instapaper isn't and it doesn't require a JavaScript run-time which readability does. So it's effectively very easy to use as a command line tool as part of an automation system.

I'll try to put up a demo page sometime soon...

stmadeveloper · May 9, 2010

So what do you do to weed out the garbage down the road? That's probably the most important part of a tool like this..........

I have an on-line version that grabs articles and content from a few different sources and as I gather them the "users" of the site vote on what they are reading makes sense. If I've scraped something that was written by a generator or just fucked up and missed half an article they usually vote it off.

I've got about 40 regular users of the site that keep up with hundreds of content items added daily. Their reward is they get to add their content to the network with links embedded, etc...

I could probably code it to check for length between paragraphs, if it includes strange words (just check verbs against their last "option" in a thesaurus for example)...........

But when you are doing it on a large scale --- how much junk are ending up with? Planning on manual review or just letting some crap slip through?

Moxxy · May 12, 2010

stmadeveloper said:
So what do you do to weed out the garbage down the road? That's probably the most important part of a tool like this..........

Well I figure if you're using a tool like that you aren't likely to be using for a whiter than white purpose so a bit of crap slipping through the cracks is probably acceptable. Recognition of crap content vs real content is a whole new kettle of fish. Not even sure how good google is at that sort of thing.

I think the best way to address it is actually at the source. Using some sort of tricksy method to choose how you get your source urls. e.g. only the first page of popular search results is unlikely to return pages full of markov shit.

I like your method of incentivised manual review. Though that's not really in scope. The tool could work with or without that sort of thing. Would just work better with.

I've put a demo page up on a godaddy host... Just waiting for their semi-annual tomcat refresh to unpack the servlet then I'll post the link.

Search

Search

Intelligent content extractor - is there a demand?

Moxxy

New member

emp

New member

mattseh

import this

webrc1

Lurker

machinecontrol

юзверь

Moxxy

New member

stmadeveloper

New member

Moxxy

New member

Intelligent content extractor - is there a demand?

New member

New member

import this

Lurker

&#1102;&#1079;&#1074;&#1077;&#1088;&#1100;

New member

New member

New member

юзверь