A strategy/tool to detect duplicate content rate on a site

Truth. · Jul 26, 2014

I'm developing a well formatted search engine type of site which will have over 10 million pages. I'm optimizing it to decrease rate of duplicated content within the site, but I need actual data to proceed.

I'm looking for a strategy/tool to detect duplicate content rate within the site. Any suggestions?

Note that only content is an issue. Since pages are basically search results, there are very similar pages depending on query. Meta tags, URLs or cannocial related issues are solved already.

I could write an app that go to every page, save important part to DB, compare each page with another, then create an average similary rate, but that would take hell a lot of time.

Thanks in advance

notwayne · Jul 26, 2014

Compare each page with another. Only a couple programmers will ever achieve O (n^n)

mattseh · Jul 26, 2014

Levenshtein distance - Wikipedia, the free encyclopedia

Truth. · Jul 26, 2014

Obviously, a rough estimate is enough for me. I would compare in word level using some public algos.

Google does a similar thing for the entire web and i want to do it within the site so could make necessary changes.

netbert · Jul 26, 2014

notwayne said:
Compare each page with another. Only a couple programmers will ever achieve O (n^n)

That would take extra talent indeed, given that it's "only" O(n²).

msharron · Jul 27, 2014

Smeh.....

Siteliner.com

skank · Jul 27, 2014

Yep, Siteliner.com, made from the guys from copyscape

devknob · Jul 27, 2014

oh really how helpful and incredibly useful thank you so much

Search

Search

A strategy/tool to detect duplicate content rate on a site

Truth.

Banned

notwayne

Banned

mattseh

import this

Truth.

Banned

netbert

New member

msharron

This is going to hurt

skank

New member

devknob

New member