Creation of a Dataset with utterances containing multiple intents including the linguistic phenomena of anaphora, cataphora & ellipsis

Postgraduate Thesis uoadl:3370694 28 Read counter

Unit:
Specialty Language Technology
Πληροφορική
Deposit date:
2023-12-05
Year:
2023
Author:
Archonti Vaia-Stavroula
Supervisors info:
Θέμος Σταφυλάκης, Εκλεγμένος Αναπληρωτής Καθηγητής Ο.Π.Α
Original Title:
Creation of a Dataset with utterances containing multiple intents including the linguistic phenomena of anaphora, cataphora & ellipsis
Languages:
English
Greek
Translated title:
Creation of a Dataset with utterances containing multiple intents including the linguistic phenomena of anaphora, cataphora & ellipsis
Summary:
Within the domain of TOD systems, intents are typically regarded as the fundamental units of recognition. In real-world applications, user utterances frequently include multiple intents, an aspect that is often ignored in most NLU datasets. Recent efforts to create such datasets, often contribute to the current trend of creating datasets containing single-intent utterances and tend to focus solely on the simple case of concatenating two single-intent utterances with conjunction. However, in real conversation scenarios, the two utterances may have the same referents or share common verbs and nouns, resulting in anaphoric, cataphoric, or elliptical constructions respectively. The primary objective of this thesis is to create a dataset consisting of multi-intent utterances that incorporate the linguistic phenomena of anaphora, cataphora and ellipsis. These utterances were created by deploying the pre-existing CLINC150 dataset. Regarding the construction of these anaphoric, cataphoric and elliptical structures, the English-GUM corpus was employed. However, the incorporation of these complex linguistic phenomena within the dataset necessitated the creation of the dataset through a manual process. The annotation process undertaken for this dataset was carried out by Canadian native speakers of English who volunteered their expertise as annotators during the dataset evaluation. Finally, two baseline experiments were carried out on the dataset: a multi-label learning technique treating double intents as an atomic label, and a threshold-based multi-label approach predicting single or double intents based only on single intents. The experimental results have indicated that the first approach exhibited positive outcomes compared to the threshold-based approach, which yielded less satisfactory results. Nevertheless, employing solely single intent labels for predicting both single and double intents could be a more effective strategy, especially considering its independence from double intents in the training set.
Main subject category:
Technology - Computer science
Keywords:
multi-intent dataset, multi-intent classification, dialogue systems, cataphora, anaphora, ellipsis
Index:
Yes
Number of index pages:
3
Contains images:
Yes
Number of references:
44
Number of pages:
54
Thesis_Archonti_Vana__.pdf (947 KB) Open in new window