how to wrap unknown html safely

Moxxy · May 16, 2010

if you're getting html from numerous sources and you've got no control over how well/badly formed it is, is there a way to wrap it safely to include it another page?

e.g. if the rogue html has some extra close div tags it will close the divs in the host page prematurely.

I've tried various html cleaners etc but with a large variety of shitty html there's always something they can't handle. Ideally is there any such thing as an absolute wrapping tag that will be read to mean any unclosed elements inside it are assumed to be closed?

mattseh · May 16, 2010

Can't you go through the html, counting opening and closing tags, if opening != closing, do something about it? Do you have a good reason to not be stripping out all tags?

Moxxy · May 16, 2010

mattseh said:
Can't you go through the html, counting opening and closing tags, if opening != closing, do something about it? Do you have a good reason to not be stripping out all tags?

Yes I could but I'm being lazy. I've writing so much parsing code the last few weeks I'm a bit jack of it so I was hoping there was an easy solution. It's actually quite a pain the ass to do because you've got to take into account the syntactical rules of each tag type and ensure you don't close it in the wrong place.

Main reasons for not wanting to strip all tags are tables/lists and the formatting tags like b,i,h etc... be nice to preserve some of those and if they aren't dealt with you get some pretty rubbish output... I've written a behemoth parsing engine/content analyzer/spinner/link injector but somewhere in the design phase I fucked up and if the input text contains html I can't reliably add new html to the output without doing a major rewrite.

Maybe I need to just suck it up and start rewriting...

stmadeveloper · May 16, 2010

Working with php?

HTML Purifier - Filter your HTML the standards-compliant way!

Haven't used it - but bookmarked it to check out.

Moxxy · May 16, 2010

java alas... don't really want to introduce an external runtime dependency...

stmadeveloper · May 16, 2010

No clue with the java - I think there is a jtidy which is a port of html tidy.

Python's also got beautiful soup. Not sure rolling your own clean up code for html is worth it with the libraries in various languages already out there.

Moxxy · May 16, 2010

stmadeveloper said:
No clue with the java - I think there is a jtidy which is a port of html tidy.

Python's also got beautiful soup. Not sure rolling your own clean up code for html is worth it with the libraries in various languages already out there.

Already using jtidy/jericho/htmlcleaner but they all crap out in certain situations... usually the crapping out means the output contains said unmatched tags (even though that's exactly what they're supposed to fix)... I'm just looking for a solution for those cases... or at the least an easy way to judge if I have a dodgy result programatically so I can discard without manual review...

may just have to bite the bullet and write some code to check for tag matches...

fatmoocow · May 16, 2010

You could just strip out the tags except for things like <br />. You would lose your formatting but you wouldn't have garbage.

Rexibit · May 16, 2010

I'd go with fatmoocow on this one. There's no reason for you to be keeping every tag for each page you are scraping. Scraping is about getting content, not their shitty formatting.

Unless it's H1...n, strong, em, br, or p, you don't want it in your scrapes. I'd also remove br and p and just reformat them prior to adding them to your database so that you know the formatting of the content is clean with the proper number of br and p's. This stops the idiot webmasters who just add a ton of br's and contentless p's to pages.

dchuk · May 16, 2010

Moxxy said:
if you're getting html from numerous sources and you've got no control over how well/badly formed it is, is there a way to wrap it safely to include it another page?

e.g. if the rogue html has some extra close div tags it will close the divs in the host page prematurely.

I've tried various html cleaners etc but with a large variety of shitty html there's always something they can't handle. Ideally is there any such thing as an absolute wrapping tag that will be read to mean any unclosed elements inside it are assumed to be closed?

Welcome to the really hard side of scraping. Scraping is easy when dealing with a known content source, like YellowPages or Google or something, but as soon as you start messing around with generic, untargeted scraping, you get to deal with every idiot in the world's shitty code.

There are a few guys on here who do Java and understand this stuff. Maybe they'll chime in here to help you out.

You could try hitting up Stanley, he's a Java nerd

pstation · May 16, 2010

in .net, i've had amazing luck with Html Agility Pack

does a great job at parsing out your typical sloppy html you'll find all over the web.

eliquid · May 16, 2010

im with taking out all the tags as well except a handful of them and then reorganizing them later...

brentdev · May 16, 2010

ditto on what most of the others said..or just use tidy (it's been ported to just about every language including an online version) - HTML Tidy Project Page

gravearc · May 16, 2010

Depending on what your goal is... I say keep their shitty formatting and make sure <div and </div> are the same amount otherwise add/remove trailing tags. Maybe check the p tags but that could get hard.

You can put your scraped components in separate divs and force styling on them that way.

Moxxy · May 16, 2010

so for everyone who strips out all html how do you deal with stuff like tables, lists etc? It turns into rubbish without the structural tags. And without manual review you can't tell if removing the whole lot looks like junk as well?

eliquid · May 17, 2010

the point of most scrapers is to get content... fuck formatting.

If your a lazy scraper and scrape the formatting too with the content and not doing anything else to it before reposting it, your doing it all wrong. Once you learn that, all you need is the content itself with some unique structuring of your own to be on the path of win.

Moxxy · May 17, 2010

eliquid said:
the point of most scrapers is to get content... fuck formatting.

If your a lazy scraper and scrape the formatting too with the content and not doing anything else to it before reposting it, your doing it all wrong. Once you learn that, all you need is the content itself with some unique structuring of your own to be on the path of win.

I've done plenty of 'scrape and reconstruct' before. This is a different thing. The point is remove the need to code to the source so the human intervention at the source end can be removed and get the best results possible. I'm aiming for a slightly lighter shade than black so want to avoid as much rubbish as possible, and dumping table text into raw text is rubbish. There's plenty of options for stripping table and list content out using semantic/gramatic/structural analysis, was just hoping there was a better way.

Search

Search

how to wrap unknown html safely

Moxxy

New member

mattseh

import this

Moxxy

New member

stmadeveloper

New member

Moxxy

New member

stmadeveloper

New member

Moxxy

New member

fatmoocow

Señor Member

Rexibit

Automation, I has it.

dchuk

Senior Botter

pstation

New member

eliquid

Serpwoo.com

brentdev

shamwow aficionado

gravearc

Banned

Moxxy

New member

eliquid

Serpwoo.com

Moxxy

New member