GreekQA: A Crowdsourcing Platform and its Use for Creating a Greek Question Answering Dataset

Graduate Thesis uoadl:3237137 206 Read counter

Unit:
Department of Informatics and Telecommunications
Πληροφορική
Deposit date:
2022-11-10
Year:
2022
Author:
SIATRAS EFSTATHIOS
Supervisors info:
Μανόλης Κουμπαράκης, Καθηγητής, Τμήμα Πληροφορικής και Τηλεπικοινωνιών, Εθνικό και Καποδιστριακό Πανεπιστήμιο Αθηνών
Original Title:
GreekQA: A Crowdsourcing Platform and its Use for Creating a Greek Question Answering Dataset
Languages:
English
Translated title:
GreekQA: A Crowdsourcing Platform and its Use for Creating a Greek Question Answering Dataset
Summary:
Teaching machines to comprehend, process, and produce human language has been a perpetual challenge since the first decades of electronic digital programmable computers. In modern times, the progress made in the research area of Natural Language Processing is present in everyday life and facilitates people with an expanding set of conveniences. This field has once again flourished with the recent arrival of increasingly sophisticated and flexible language models. These state-of-the-art models have tackled a plethora of Natural Language Processing tasks bringing significant performance gains. Machine reading comprehension has been one of the cornerstone tasks that benefited from these recent advances. This challenging task requires machines to read a passage of text and answer questions based on the context. Besides the structure of the models, reading comprehension datasets have also played a decisive role in bringing successful results. Motivated by this trend in reading comprehension task, an increasing number of question answering datasets have appeared in English and a specific group of other languages. Regarding the Greek language, there has been no progress on native question answering datasets other than automatically translated ones from other languages.

In light of the above, we present the Greek Question Answering (GreekQA) dataset, a Greek reading comprehension dataset based on Wikipedia articles. GreekQA1.0 dataset consists of 1,000+ questions posed by crowdworkers on curated passages from a set of Wikipedia articles in Greek. For the development of the GreekQA dataset, we also introduce the namesake GreekQA Crowdsourcing Annotation Platform, a web application specifically designed and implemented for crowdsourcing the collection of question and answer pairs for this dataset. We analyze the requirements and the selected technologies of the GreekQA crowdsourcing platform, describe the design of the platform, and present the implementation. We describe the procedure of curating passages and the defined guidelines of collecting question and answer pairs. In order to understand the properties of the GreekQA1.0, we analyze the questions and answers as well as the reasoning required to answer the questions based on the corresponding passage. Finally, we evaluate the Human Performance as a baseline for future experimental evaluation of language models using this dataset.
Main subject category:
Technology - Computer science
Keywords:
Machine Reading Comprehension, Question Answering, Dataset Collection, Crowdsourcing Platform
Index:
Yes
Number of index pages:
4
Contains images:
Yes
Number of references:
30
Number of pages:
50
file.pdf (1 MB) Open in new window