I'm working on a tool atm that intelligently extracts the article portion of a web page. No specific regex or prior knowedge of the page structure required, in most cases it will detect which part of the page is the actual article text and pull it out for you or if there's multiple 'zones' let you pull them out one at a time. If it's unable to establish a certain confidence level it will let you know so you don't end up with rubbish.
So basically with a crawler module tacked on it you could plug some key words and extract usable content from all over place...Not for the whitehat at heart obviously... But it opens a lot of new content sources with very little work and get's you off the RSS feeds that everyone else is flogging to death.
It still needs some tuning and I was really only writing it a java lib for myself but I could turn it into a command line app or even a GUI app if there's a demand for it.
So how about it? Would it be useful? Is it already done? Is it worth paying for? How much? What features would you want tacked on?
*pre-emptive reply*
fuck off Extor, no you can't have the source code.
So basically with a crawler module tacked on it you could plug some key words and extract usable content from all over place...Not for the whitehat at heart obviously... But it opens a lot of new content sources with very little work and get's you off the RSS feeds that everyone else is flogging to death.
It still needs some tuning and I was really only writing it a java lib for myself but I could turn it into a command line app or even a GUI app if there's a demand for it.
So how about it? Would it be useful? Is it already done? Is it worth paying for? How much? What features would you want tacked on?
*pre-emptive reply*
fuck off Extor, no you can't have the source code.