A strategy/tool to detect duplicate content rate on a site

Truth.

Banned
Jul 17, 2014
29
0
0
I'm developing a well formatted search engine type of site which will have over 10 million pages. I'm optimizing it to decrease rate of duplicated content within the site, but I need actual data to proceed.

I'm looking for a strategy/tool to detect duplicate content rate within the site. Any suggestions?

Note that only content is an issue. Since pages are basically search results, there are very similar pages depending on query. Meta tags, URLs or cannocial related issues are solved already.

I could write an app that go to every page, save important part to DB, compare each page with another, then create an average similary rate, but that would take hell a lot of time.

Thanks in advance
 


Obviously, a rough estimate is enough for me. I would compare in word level using some public algos.

Google does a similar thing for the entire web and i want to do it within the site so could make necessary changes.