Parallel meta-blocking for scaling entity resolution over big heterogeneous data

Επιστημονική δημοσίευση - Άρθρο Περιοδικού uoadl:3070956 16 Αναγνώσεις

Μονάδα:
Ερευνητικό υλικό ΕΚΠΑ
Τίτλος:
Parallel meta-blocking for scaling entity resolution over big heterogeneous data
Γλώσσες Τεκμηρίου:
Αγγλικά
Περίληψη:
Entity resolution constitutes a crucial task for many applications, but has an inherently quadratic complexity. In order to enable entity resolution to scale to large volumes of data, blocking is typically employed: it clusters similar entities into (overlapping) blocks so that it suffices to perform comparisons only within each block. To further increase efficiency, Meta-blocking is being used to clean the overlapping blocks from unnecessary comparisons, increasing precision by orders of magnitude at a small cost in recall. Despite its high time efficiency though, using Meta-blocking in practice to solve entity resolution problem on very large datasets is still challenging: applying it to 7.4 million entities takes (almost) 8 full days on a modern high-end server. In this paper, we introduce scalable algorithms for Meta-blocking, exploiting the MapReduce framework. Specifically, we describe a strategy for parallel execution that explicitly targets the core concept of Meta-blocking, the blocking graph. Furthermore, we propose two more advanced strategies, aiming to reduce the overhead of data exchange. The comparison-based strategy creates the blocking graph implicitly, while the entity-based strategy is independent of the blocking graph, employing fewer MapReduce jobs with a more elaborate processing. We also introduce a load balancing algorithm that distributes the computationally intensive workload evenly among the available compute nodes. Our experimental analysis verifies the feasibility and superiority of our advanced strategies, and demonstrates their scalability to very large datasets. © 2017 Elsevier Ltd
Έτος δημοσίευσης:
2017
Συγγραφείς:
Efthymiou, V.
Papadakis, G.
Papastefanatos, G.
Stefanidis, K.
Palpanas, T.
Περιοδικό:
Information Systems Control Journal
Εκδότης:
Elsevier Ireland Ltd
Τόμος:
65
Σελίδες:
137-157
Λέξεις-κλειδιά:
Electronic data interchange, Experimental analysis; High time efficiency; Load balancing algorithms; Map/reduce; Mapreduce frameworks; Meta-blocking; Parallelizations; Quadratic complexity, Efficiency
Επίσημο URL (Εκδότης):
DOI:
10.1016/j.is.2016.12.001
Το ψηφιακό υλικό του τεκμηρίου δεν είναι διαθέσιμο.