Pergamos - Library and Information Center of National and Kapodistrian University of Athens

Unit:

Department of Informatics and Telecommunications
Πληροφορική

Deposit date:

2021-09-16

Year:

2021

Author:

VAMVOURELLIS EFSTRATIOS

Supervisors info:

Μανόλης Κουμπαράκης, Καθηγητής, Εθνικό και Καποδιστριακό Πανεπιστήμιο Αθηνών , Τμήμα Πληροφορικής και Τηλεπικοινωνιών
Δέσποινα-Αθανασία Πανταζή, Υποψήφια Διδάκτωρ , Εθνικό και Καποδιστριακό Πανεπιστήμιο Αθηνών , Τμήμα Πληροφορικής και Τηλεπικοινωνιών

Original Title:

Comp-BERT-ition: Which BERT model is better for Greek legal text classification?

Languages:

English

Translated title:

Comp-BERT-ition: Which BERT model is better for Greek legal text classification?

Summary:

Deep Neural Networks (DNN) is a very hot subfield of Artificial Intelligence (AI). Experts
believe that it may be the future of Computer Science. Natural Language Processing (NLP)
is the area of AI and linguistics concerned with the interactions between computers and
human language, in particular how to program computers to process and analyze natural
language data. With the creation of BERT [5], a large DNN tasked with the understanding
of the English Language, in 2018, and its integration into the Google search algorithm,
the field of NLP took a big leap forward. Since then, only a few models have managed to
surpass BERT by a bit. Until recently, in 2020, BERT and its variants, were considered
state of the art. The current thesis examines different variations of the BERT model,
trained on different datasets and their ability to classify Greek legal documents. It also
discusses ways to further improve our finetuned models for legal domain tasks, like
domain specific adaptation and vocabulary expansion. We use the RAPTARCHIS [3]
dataset which provides Greek legal documents for three classification tasks. Our finetuned Greekonly models have very similar performance, while the multilingual model falls
behind. We conclude that using domain and task adaptive pretraining, the performance
will surly improve across all models. We also hypothesize that, based on known heuristics,
described in chapter 5, the multilingual model could surpass the others. The metrics we
use are precision (P), recall (R) and F1 score. We chose these metrics to have a direct
comparison with the prior models evaluated on the same dataset.

Main subject category:

Technology - Computer science

Keywords:

BERT, Neural Networks, Natural Language Processing,Legal Documents

Index:

Yes

Number of index pages:

Contains images:

Yes

Number of references:

Number of pages:

File: