Classification of the Approaches of the Near Duplicate Document Detection and Elimination

Kavita Goyal, Saba Hilal, jayshankar prasad

Abstract


The area of identification and removal of near duplicate document is important to research as near duplicate pages increase overhead on web. It increases storage space and indexing cost, crawlers produces same type of results and make searching process ineffective. Identical pages on web exists because it contains huge volume of records. Many researches have been done in this area to solve the problem but the problem still exists. The researchers have studied the problem from different perspectives and tried to formulate solutions, however the problem is intensifying as new pages are added to the web. This paper studies previous research work and classify those algorithms and approaches with the intention of structuring the area of duplicate document finding.


پاراگلایدر Full Text: PDF

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.

ISSN : 2251-1563