Identifying near-Duplicate Documents

Graduate Thesis uoadl:2926479 142 Read counter

Unit:
Department of Informatics and Telecommunications
Πληροφορική
Deposit date:
2020-10-28
Year:
2020
Author:
REPPAS IOANNIS
Supervisors info:
Δημήτρης Γουνόπουλος, Καθηγητής, Τμήμα Πληροφορικής και Τηλεπικοινωνιών, Εθνικό και Καποδιστριακό Πανεπιστήμιο Αθηνών
Original Title:
Αναγνωρίζοντας σχεδόν-Διπλότυπα Αρχεία
Languages:
Greek
Translated title:
Identifying near-Duplicate Documents
Summary:
The aim of this thesis was to Identify near-duplicate documents. Near- duplicate documents are documents that resemble each other. The mathematical concept of document resemblance captures well the informal notion of syntactic similarity. Specifically, a fixed size sketch per document is calculated. A fixed size sketch is a small subset of the starting document, with a size of a few hundred bytes. The sketches of two documents are used for the estimation of their resemblance. Resemblance is expressed with a number ranging from 0 to 1. The closer to 1, the more the two documents resemble each other. After performing a number of tests, it was concluded that the algorithm functions properly.
The algorithm for comparing two documents and estimating their resemblance is useful in various situations, e.g. for just comparing two documents, and for filtering duplicate results in search engines so that users get a variety of different results for more useful information. A similar algorithm to what was implemented in this thesis, has been developed and used by AltaVista search engine.
Main subject category:
Technology - Computer science
Keywords:
resemblance, comparison, shingling, permutations, sketch
Index:
Yes
Number of index pages:
3
Contains images:
Yes
Number of references:
5
Number of pages:
42
IdentifyingNearDuplicateDocuments.pdf (508 KB) Open in new window