how to wrap unknown html safely

Moxxy

New member
Jul 10, 2009
189
6
0
if you're getting html from numerous sources and you've got no control over how well/badly formed it is, is there a way to wrap it safely to include it another page?

e.g. if the rogue html has some extra close div tags it will close the divs in the host page prematurely.

I've tried various html cleaners etc but with a large variety of shitty html there's always something they can't handle. Ideally is there any such thing as an absolute wrapping tag that will be read to mean any unclosed elements inside it are assumed to be closed?
 


Can't you go through the html, counting opening and closing tags, if opening != closing, do something about it? Do you have a good reason to not be stripping out all tags?
 
Can't you go through the html, counting opening and closing tags, if opening != closing, do something about it? Do you have a good reason to not be stripping out all tags?

Yes I could but I'm being lazy. I've writing so much parsing code the last few weeks I'm a bit jack of it so I was hoping there was an easy solution. It's actually quite a pain the ass to do because you've got to take into account the syntactical rules of each tag type and ensure you don't close it in the wrong place.

Main reasons for not wanting to strip all tags are tables/lists and the formatting tags like b,i,h etc... be nice to preserve some of those and if they aren't dealt with you get some pretty rubbish output... I've written a behemoth parsing engine/content analyzer/spinner/link injector but somewhere in the design phase I fucked up and if the input text contains html I can't reliably add new html to the output without doing a major rewrite.

Maybe I need to just suck it up and start rewriting...
 
java alas... don't really want to introduce an external runtime dependency...
 
No clue with the java - I think there is a jtidy which is a port of html tidy.

Python's also got beautiful soup. Not sure rolling your own clean up code for html is worth it with the libraries in various languages already out there.
 
No clue with the java - I think there is a jtidy which is a port of html tidy.

Python's also got beautiful soup. Not sure rolling your own clean up code for html is worth it with the libraries in various languages already out there.

Already using jtidy/jericho/htmlcleaner but they all crap out in certain situations... usually the crapping out means the output contains said unmatched tags (even though that's exactly what they're supposed to fix)... I'm just looking for a solution for those cases... or at the least an easy way to judge if I have a dodgy result programatically so I can discard without manual review...

may just have to bite the bullet and write some code to check for tag matches...
 
You could just strip out the tags except for things like <br />. You would lose your formatting but you wouldn't have garbage.
 
I'd go with fatmoocow on this one. There's no reason for you to be keeping every tag for each page you are scraping. Scraping is about getting content, not their shitty formatting.

Unless it's H1...n, strong, em, br, or p, you don't want it in your scrapes. I'd also remove br and p and just reformat them prior to adding them to your database so that you know the formatting of the content is clean with the proper number of br and p's. This stops the idiot webmasters who just add a ton of br's and contentless p's to pages.
 
if you're getting html from numerous sources and you've got no control over how well/badly formed it is, is there a way to wrap it safely to include it another page?

e.g. if the rogue html has some extra close div tags it will close the divs in the host page prematurely.

I've tried various html cleaners etc but with a large variety of shitty html there's always something they can't handle. Ideally is there any such thing as an absolute wrapping tag that will be read to mean any unclosed elements inside it are assumed to be closed?

Welcome to the really hard side of scraping. Scraping is easy when dealing with a known content source, like YellowPages or Google or something, but as soon as you start messing around with generic, untargeted scraping, you get to deal with every idiot in the world's shitty code.

There are a few guys on here who do Java and understand this stuff. Maybe they'll chime in here to help you out.

You could try hitting up Stanley, he's a Java nerd :)
 
Depending on what your goal is... I say keep their shitty formatting and make sure <div and </div> are the same amount otherwise add/remove trailing tags. Maybe check the p tags but that could get hard.

You can put your scraped components in separate divs and force styling on them that way.
 
so for everyone who strips out all html how do you deal with stuff like tables, lists etc? It turns into rubbish without the structural tags. And without manual review you can't tell if removing the whole lot looks like junk as well?
 
the point of most scrapers is to get content... fuck formatting.

If your a lazy scraper and scrape the formatting too with the content and not doing anything else to it before reposting it, your doing it all wrong. Once you learn that, all you need is the content itself with some unique structuring of your own to be on the path of win.
 
the point of most scrapers is to get content... fuck formatting.

If your a lazy scraper and scrape the formatting too with the content and not doing anything else to it before reposting it, your doing it all wrong. Once you learn that, all you need is the content itself with some unique structuring of your own to be on the path of win.

I've done plenty of 'scrape and reconstruct' before. This is a different thing. The point is remove the need to code to the source so the human intervention at the source end can be removed and get the best results possible. I'm aiming for a slightly lighter shade than black so want to avoid as much rubbish as possible, and dumping table text into raw text is rubbish. There's plenty of options for stripping table and list content out using semantic/gramatic/structural analysis, was just hoping there was a better way.