Pergamos - Library and Information Center of National and Kapodistrian University of Athens

Unit:

Department of Informatics and Telecommunications
Πληροφορική

Deposit date:

2022-03-27

Year:

2022

Author:

SPANOPOULOS ANDREAS-THEOLOGOS

Supervisors info:

Κουμπαράκης Μανόλης, Καθηγητής, Τμήμα Πληροφορικής και Τηλεπικοινωνιών, ΕΚΠΑ

Original Title:

Language Models for Ancient Greek

Languages:

English

Translated title:

Language Models for Ancient Greek

Summary:

BERT is a pre-trained Language Model introduced by Google AI Language in 2018, that
manages to achieve state-of-the-art results on many downstream tasks. It is a revolutionary model that can be used to tackle almost any NLP task. It has also been successfully
applied to other languages apart from English, such as French, Spanish and even Greek
and Latin.

The goal of this thesis was to create a Language Model for the Ancient Greek language
based on a BERT-like architecture, and then fine-tune it to some downstream task, specifically Part-of-Speech tagging. There were two main steps that needed to be taken in
order to develop the model. First, there was the data collection part, and then training a
language model with the data acquired.

Regarding the data collection part, a lot of research was done in order to find publicly
available sources with plain-text data, but not much was found. In total, the amount of
data that we were able to collect was a bit less than 450 MB. This was a major problem,
as most BERT-like models had been trained on corpora in the scale of 30-50 GB.

Regarding the training part, the RoBERTa pre-training and model were chosen as a frame-
work for the Ancient Greek Language Model, since it had demonstrated better performance than other variants. The training objective was that of Masked Language Modelling
with dynamic masking.

Unexpectedly, the results were two-fold. One the one hand, the Language Model kept
underfitting due to the fact that it wasn’t seeing enough data. Many ideas were tried such
as reducing the model size and tuning the hyperparameters with bayesian optimization,
but none yielded good results. On the other hand, when fine-tuning for PoS Tagging,
the results were reasonably good, which suggests that the Language Model has learnt
important aspects of the Ancient Greek language.

By taking a look at the training curves, we can see that the model is definitely learning
something as the loss keeps decreasing, up until a point where it converges. We strongly
believe that this underfitting effect is due to the lack of a much larger corpora. If more data
is made available in the future, it would be definitely worth trying out again this approach.
That’s why the code for downloading the data and training a model is made available at
https://github.com/AndrewSpano/BSc-Thesis.

Main subject category:

Technology - Computer science

Keywords:

Artificial Intelligence, Machine Learning, Natural Language Processing, Deep Learning, Transformers, BERT

Index:

Yes

Number of index pages:

Contains images:

Yes

Number of references:

Number of pages:

File: