Breaking String Analytics Bottlenecks in Data Science Pipelines

Doctoral Dissertation uoadl:3371018 43 Read counter

Unit:
Κατεύθυνση / ειδίκευση Διαχείριση Πληροφορίας και Δεδομένων (ΔΕΔ)
Πληροφορική
Deposit date:
2023-12-08
Year:
2023
Author:
Foufoulas Yannis
Dissertation committee:
Ιωάννης Ιωαννίδης, Καθηγητής, Πληροφορικής και Τηλεπικοινωνιών, ΕΚΠΑ
Μιχάηλ Χατζόπουλος, Ομότιμος Καθηγητής, Πληροφορικής και Τηλεπικοινωνιών, ΕΚΠΑ
Αλέξης Δελης, Καθηγητής, Πληροφορικής και Τηλεπικοινωνιών, ΕΚΠΑ
Μέμα Ρουσσοπούλου, Καθηγήτρια, Πληροφορικής και Τηλεπικοινωνιών, ΕΚΠΑ
Ιωάννης Παπακωνσταντίνου, Καθηγητής, Πανεπιστήμιο Καλιφόρνια - Σαν Ντιέγκο
Ευστράτιος Υδραίος, Αναπληρωτής Καθηγητής, Πανεπιστήμιο Χάρβαρντ
"Άλκης Σιμιτσής, Διευθυντής Ερευνών, Ερευνητικό κέντρο ΑΘΗΝΑ"
Original Title:
Αντιμετώπιση σημείων συμφόρησης στην ανάλυση συμβολοσειρών σε εφαρμογές επιστήμης δεδομένων
Languages:
English
Translated title:
Breaking String Analytics Bottlenecks in Data Science Pipelines
Summary:
The diversity and complexity of modern data management applications along with the
increase of available data open up opportunities for improvements in all the execution layers
of a common data science workflow. In this thesis, we target string analytics and we
optimise a common analytical workflow tackling several major bottlenecks. We examine
the citation matching problem as our usecase.
This is a significant problem as it allows
researchers to search literature, find relative publications, conduct analysis of research
trends and so on. This is also a complex problem, given the diversity of data sources,
the exponential increase of open access literature and the heavy string processing involved.
We break down the problem into three different layers: 1) algorithm layer 2)string
storage/processing layer 3) implementation layer
Each of these layers introduce its own bottlenecks. Current literature proposes algorithms
that first mine texts and extract structured citations using machine learning and heuristics.
These citations are joined against the target metadata to produce a citation graph. The
first step is the heavy one.Given the various references section templates that exist, it is
not possible to produce an accurate rule based solution for citation extraction.
The storage and string processing layer involves its own overheads. String processing is
heavy, and also data volumes are big. An efficient layout for string storage could optimise
both data storage and transfer and execution of string operations like scans and joins.
The implementation layer contains both procedural operations (e.g., text preprocessing,
normalizations and so on) which are usually implemented in high level languages like
Python, and string processing operations (e.g., scans, joins) that could be more efficiently
executed in the context of a data management system. So the problem is, how to merge
these two types of operations in an effective way.
The contributions of the thesis can be summarized as follows:
• We design a citation matching algorithm which skips the heavy citation extraction
step and merges the text processing and citation matching steps into one. According
to our experiments, the proposed technique achieves higher recall rates (i.e.,
produces up to 20% more citation links), and is up to more than two orders of magnitude
faster than alternative techniques which first extract structured citations from
plain texts.
• We introduce a highly compressible storage format for string attributes based on
adaptive dictionary encoding that allows fast scans, filters, sorts, joins and random
lookups. Our experiments have given very promising results, showing that, in many
cases, the proposed new dictionary compression scheme is much more efficient
than existing techniques in terms of compression rates and read operations, occasionteally
up to an order of magnitude.
• We present an SQL extension that allows the easy and efficient integration of procedural
operations written in Python within SQL. The proposed SQL extension supports
the creation of expressive user defined functions (UDFs) which are seamlessly
integrated with an embedded or server based DBMS and utilizes techniques like inprocess
UDF execution, tracing JIT compilation, vectorization, and UDF fusion to
enhance performance. Our experimental analysis showcases the usability and expressiveness
of the proposed SQL extension and demonstrates that our techniques
are very effective and achieve significant speedups up to 68x in common, practical
use cases compared to earlier approaches and alternative implementation choices.
• We implement a UDF library for text analysis on top of the SQL extension. Its purpose
is to enable indatabase
text analysis reaping the benefits of DBMSes and the
extended SQL language. The library contains more than 150 functions that can be
used as building blocks to implement algorithms for text analytics.
• We conduct a variety of endtoend
experiments and microbenchmarks that prove
the effectiveness of the proposed solutions. We evaluate all the proposed solutions
in terms of performance and additionally, we evaluate the citation matching algorithm
in terms of accuracy, adaptive dictionary encoding in terms of compression rates,
and we examine the usability of YeSQL and its library. The experiments show that
the proposed techniques outperform the state of the art alternatives.
Finally, we propose several optimisation opportunities which emerge and are part of our
current and future work.
Main subject category:
Technology - Computer science
Keywords:
Citation matching, Compression, Extended SQL
Index:
Yes
Number of index pages:
4
Contains images:
Yes
Number of references:
114
Number of pages:
142
dissertation.pdf (2 MB) Open in new window