Building End-to-End Neural Machine Translation Systems for Crisis Scenarios: The Case of COVID-19

Postgraduate Thesis uoadl:3257620 68 Read counter

Κατεύθυνση Μεγάλα Δεδομένα και Τεχνητή Νοημοσύνη
Deposit date:
Roussis Dimitrios
Supervisors info:
Βασίλης Κατσούρος, Διευθυντής Ερευνών, Ερευνητικό Κέντρο «Αθηνά»
Βασίλης Παπαβασιλείου, Συνεργαζόμενος Ερευνητής, Ερευνητικό Κέντρο «Αθηνά»
Σωκράτης Σοφιανόπουλος, Συνεργαζόμενος Ερευνητής, Ερευνητικό Κέντρο «Αθηνά»
Original Title:
Building End-to-End Neural Machine Translation Systems for Crisis Scenarios: The Case of COVID-19
Translated title:
Building End-to-End Neural Machine Translation Systems for Crisis Scenarios: The Case of COVID-19
Machine Translation is a crucial task of Natural Language Processing, as it aims to provide a fast and automatic way of translating various types of texts. In recent years, the emergence of Neural Machine Translation and the compilation of large-scale parallel corpora have led to significant improvements in translation quality. However, translation models are not necessarily suited for all domains and, thus, there has been significant research on domain adaptation of Neural Machine Translation Systems, i.e., on how to best improve the translation quality of an existing system for a specific topic or genre.

Crisis Machine Translation is a special case of Domain Adaptation which is concerned with the rapid adaptation of an existing Machine Translation system for a crisis scenario, as the integration of such a system in a rapid response infrastructure can accelerate the speed of decision making and relief provision. The COVID-19 pandemic proved to be a prolonged and global crisis with large gaps in transparent, timely, and effective communication; it was also marked by misinformation, conspiracy theories, and significant restrictions on press freedom. Further research on Crisis Machine Translation could play an important role in better responding to future similar crises.

In this thesis, we focus on the case of the COVID-19 pandemic and the English-Greek translation direction, while we also create two domain-specific multilingual parallel corpora; one which is related to COVID-19 and one which has been gathered from the abstracts of academic theses and dissertations.

First, we describe the methodologies of acquiring new domain-specific parallel corpora and generating synthetic data which are combined with existing parallel data so as to adapt an existing system to the domain of COVID-19. This process includes data filtering, pre-processing, and selection pipelines, which are also described in detail.
Afterwards, we conduct experiments on different fine-tuning strategies for a simulated crisis scenario in which varying amounts of related data become available as time progresses. We are also concerned with the phenomenon of “catastrophic forgetting”, i.e., the degradation of system performance on general texts.

Lastly, we construct an end-to-end Neural Machine Translation system which is specialized in translating COVID-19 related English texts into Greek. In order to assess its performance across different domains and determine its strengths and weaknesses, we conduct an extended evaluation with eight test sets (half of them have been specifically created for this thesis) and other publicly available models and commercial translation services.
Main subject category:
Technology - Computer science
COVID-19, Domain Adaptation, Crisis Machine Translation, Multilingual Corpora Acquisition, Transformers
Number of index pages:
Contains images:
Number of references:
Number of pages:
dsit_roussis_master_thesis_final.pdf (1 MB) Open in new window