PyJedAI Parallelization with MPIRE

Graduate Thesis uoadl:3395772 19 Read counter

Unit:
Department of Informatics and Telecommunications
Πληροφορική
Deposit date:
2024-04-08
Year:
2024
Author:
KONTONIS ΗΛΙΑΣ-ΕΛΕΥΘΕΡΙΟΣ
Supervisors info:
Manolis Koubarakis, Professor and Director of Graduate Studies in the Department of Informatics and Telecommunications, National and Kapodistrian University
Original Title:
PyJedAI Parallelization with MPIRE
Languages:
English
Translated title:
PyJedAI Parallelization with MPIRE
Summary:
Entity resolution is a critical task in various applications, but it faces quadratic complex-
ity. To make entity resolution scalable to large datasets, blocking is typically employed.
Syntactic blocking methods usually group similar entities into overlapping blocks, redu-
cing the number of necessary comparisons. Further efficiency gains are achieved with
Meta-blocking, which prunes unnecessary comparisons in overlapping blocks, signific-
antly improving precision without sacrificing much recall.
However, despite its time efficiency, applying Meta-blocking to solve entity resolution prob-
lems on very large datasets remains a challenge. For instance, processing 7.4 million
entities can take almost eight full days on a high-end server.
In this thesis, we work with the parallelization of the python framework PyJedAI. Python
introduces new challenges due to the Global Interpreter Lock (GIL) which forces us to
implement a fork-join model instead of generating multiple threads. We use the MPIRE
python module to implement the parallel Meta-blocking algorithms.
The experimental analysis validates the scalability of the parallel implementation as well
as the significant time reduction in certain steps of the Meta-blocking. We also analyze the
deadlocks we encountered in the time efficiency of our implementation due to the fork-join
model and how it is possible to get over them.
Main subject category:
Science
Keywords:
Entity Resolution, Meta-blocking, parallelization, fork-join, GIL, MPIRE, PyJedAI
Index:
Yes
Number of index pages:
2
Contains images:
Yes
Number of references:
33
Number of pages:
43
BSc_Thesis_on_pyJedAI_Parallelization_with_MPIRE.pdf (1 MB) Open in new window