I'm developing a well formatted search engine type of site which will have over 10 million pages. I'm optimizing it to decrease rate of duplicated content within the site, but I need actual data to proceed.
I'm looking for a strategy/tool to detect duplicate content rate within the site. Any suggestions?
Note that only content is an issue. Since pages are basically search results, there are very similar pages depending on query. Meta tags, URLs or cannocial related issues are solved already.
I could write an app that go to every page, save important part to DB, compare each page with another, then create an average similary rate, but that would take hell a lot of time.
Thanks in advance
I'm looking for a strategy/tool to detect duplicate content rate within the site. Any suggestions?
Note that only content is an issue. Since pages are basically search results, there are very similar pages depending on query. Meta tags, URLs or cannocial related issues are solved already.
I could write an app that go to every page, save important part to DB, compare each page with another, then create an average similary rate, but that would take hell a lot of time.
Thanks in advance