AutoER: Auto-Configuring Entity Resolution pipelines

Postgraduate Thesis uoadl:3417680 54 Read counter

Unit:
Κατεύθυνση Διαχείριση Δεδομένων, Πληροφορίας και Γνώσης
Πληροφορική
Deposit date:
2024-09-26
Year:
2024
Author:
Nikoletos Konstantinos
Supervisors info:
Βασίλης Ευθυμίου, Επίκουρος Καθηγητής, Τμήμα Πληροφορικής και Τηλεματικής, Χαροκόπειο Πανεπιστήμιο
Γιώργος Παπαδάκης, Ερευνητής, Τμήμα Πληροφορικής και Τηλεπικοινωνιών, ΕΚΠΑ
Κωνσταντίνος Στεφανίδης, Καθηγητής, Τμήμα Τεχνολογιών Πληροφορικής και Επικοινωνιών, Tampere University Finland
Μανόλης Κουμπαράκης, Καθηγητής, Τμήμα Πληροφορικής και Τηλεπικοινωνιών, ΕΚΠΑ
Original Title:
AutoER: Auto-Configuring Entity Resolution pipelines
Languages:
English
Translated title:
AutoER: Auto-Configuring Entity Resolution pipelines
Summary:
The same real-world entity (e.g., a movie, a restaurant, a person) may be described in various ways on different datasets. Entity resolution (ER) is the problem of finding such descriptions that refer to the same entity, this way improving data quality and therefore, data value. However, an ER pipeline typically involves several steps (e.g., blocking, similarity estimation, clustering), with each step requiring its own configurations and tuning. The choice of the best configuration, among a vast number of possible combinations, is dataset-specific, as it has been shown experimentally, while it often requires the existence of some pre-labeled examples, i.e., a ground truth. In essence, finding the best configuration for resolving the entities of a dataset is a labor-intensive task not only for simple users that want their data cleaned, but also for ER experts. In this work, we introduce AutoER, a framework that automatically suggests the most promising ER configuration, even when a ground truth is not available. AutoER relies on sampling strategies that can significantly reduce the search space of configuration values, when some ground truth is available. When no pre-labeled examples of a given dataset are available, AutoER relies on a pre-defined set of ER-specific dataset features, along with configuration features, and other datasets that have an available ground truth, to train a regression model. We show experimentally that AutoER consistently and efficiently suggests near-optimal ER configurations, by comparing it to an exhaustive grid search, over eleven ER benchmark datasets.
Main subject category:
Technology - Computer science
Keywords:
Entity Resolution, Auto Configuration, Artificial Intelligence
Index:
No
Number of index pages:
0
Contains images:
Yes
Number of references:
70
Number of pages:
41
File:
File access is restricted until 2025-03-26.

AutoER_MSC_Thesis.pdf
970 KB
File access is restricted until 2025-03-26.