Comp-­BERT-­ition: Which BERT model is better for Greek legal text classification?

Graduate Thesis uoadl:2960898 186 Read counter

Unit:
Department of Informatics and Telecommunications
Πληροφορική
Deposit date:
2021-09-16
Year:
2021
Author:
VAMVOURELLIS EFSTRATIOS
Supervisors info:
Μανόλης Κουμπαράκης, Καθηγητής, Εθνικό και Καποδιστριακό Πανεπιστήμιο Αθηνών , Τμήμα Πληροφορικής και Τηλεπικοινωνιών
Δέσποινα-Αθανασία Πανταζή, Υποψήφια Διδάκτωρ , Εθνικό και Καποδιστριακό Πανεπιστήμιο Αθηνών , Τμήμα Πληροφορικής και Τηλεπικοινωνιών
Original Title:
Comp-­BERT-­ition: Which BERT model is better for Greek legal text classification?
Languages:
English
Translated title:
Comp-­BERT-­ition: Which BERT model is better for Greek legal text classification?
Summary:
Deep Neural Networks (DNN) is a very hot subfield of Artificial Intelligence (AI). Experts
believe that it may be the future of Computer Science. Natural Language Processing (NLP)
is the area of AI and linguistics concerned with the interactions between computers and
human language, in particular how to program computers to process and analyze natural
language data. With the creation of BERT [5], a large DNN tasked with the understanding
of the English Language, in 2018, and its integration into the Google search algorithm,
the field of NLP took a big leap forward. Since then, only a few models have managed to
surpass BERT by a bit. Until recently, in 2020, BERT and its variants, were considered
state of the art. The current thesis examines different variations of the BERT model,
trained on different datasets and their ability to classify Greek legal documents. It also
discusses ways to further improve our fine­tuned models for legal domain tasks, like
domain specific adaptation and vocabulary expansion. We use the RAPTARCHIS [3]
dataset which provides Greek legal documents for three classification tasks. Our finetuned Greek­only models have very similar performance, while the multilingual model falls
behind. We conclude that using domain and task adaptive pre­training, the performance
will surly improve across all models. We also hypothesize that, based on known heuristics,
described in chapter 5, the multilingual model could surpass the others. The metrics we
use are precision (P), recall (R) and F1 score. We chose these metrics to have a direct
comparison with the prior models evaluated on the same dataset.
Main subject category:
Technology - Computer science
Keywords:
BERT, Neural Networks, Natural Language Processing,Legal Documents
Index:
Yes
Number of index pages:
3
Contains images:
Yes
Number of references:
27
Number of pages:
37
Bert_thesis.pdf (996 KB) Open in new window