Unit:
Department of Informatics and TelecommunicationsΠληροφορική
Supervisors info:
Δημήτρης Γουνόπουλος, Καθηγητής, Τμήμα Πληροφορικής και Τηλεπικοινωνιών, Εθνικό και Καποδιστριακό Πανεπιστήμιο Αθηνών
Original Title:
Αναγνωρίζοντας σχεδόν-Διπλότυπα Αρχεία
Translated title:
Identifying near-Duplicate Documents
Summary:
The aim of this thesis was to Identify near-duplicate documents. Near- duplicate documents are documents that resemble each other. The mathematical concept of document resemblance captures well the informal notion of syntactic similarity. Specifically, a fixed size sketch per document is calculated. A fixed size sketch is a small subset of the starting document, with a size of a few hundred bytes. The sketches of two documents are used for the estimation of their resemblance. Resemblance is expressed with a number ranging from 0 to 1. The closer to 1, the more the two documents resemble each other. After performing a number of tests, it was concluded that the algorithm functions properly.
The algorithm for comparing two documents and estimating their resemblance is useful in various situations, e.g. for just comparing two documents, and for filtering duplicate results in search engines so that users get a variety of different results for more useful information. A similar algorithm to what was implemented in this thesis, has been developed and used by AltaVista search engine.
Main subject category:
Technology - Computer science
Keywords:
resemblance, comparison, shingling, permutations, sketch