Pergamos - Library and Information Center of National and Kapodistrian University of Athens

Unit:

Department of Informatics and Telecommunications
Πληροφορική

Deposit date:

2020-10-28

Year:

2020

Author:

REPPAS IOANNIS

Supervisors info:

Δημήτρης Γουνόπουλος, Καθηγητής, Τμήμα Πληροφορικής και Τηλεπικοινωνιών, Εθνικό και Καποδιστριακό Πανεπιστήμιο Αθηνών

Original Title:

Αναγνωρίζοντας σχεδόν-Διπλότυπα Αρχεία

Languages:

Greek

Translated title:

Identifying near-Duplicate Documents

Summary:

The aim of this thesis was to Identify near-duplicate documents. Near- duplicate documents are documents that resemble each other. The mathematical concept of document resemblance captures well the informal notion of syntactic similarity. Specifically, a fixed size sketch per document is calculated. A fixed size sketch is a small subset of the starting document, with a size of a few hundred bytes. The sketches of two documents are used for the estimation of their resemblance. Resemblance is expressed with a number ranging from 0 to 1. The closer to 1, the more the two documents resemble each other. After performing a number of tests, it was concluded that the algorithm functions properly.
The algorithm for comparing two documents and estimating their resemblance is useful in various situations, e.g. for just comparing two documents, and for filtering duplicate results in search engines so that users get a variety of different results for more useful information. A similar algorithm to what was implemented in this thesis, has been developed and used by AltaVista search engine.

Main subject category:

Technology - Computer science

Keywords:

resemblance, comparison, shingling, permutations, sketch

Index:

Yes

Number of index pages:

Contains images:

Yes

Number of references:

Number of pages:

File: