Summary:
Within the domain of TOD systems, intents are typically regarded as the fundamental units of recognition. In real-world applications, user utterances frequently include multiple intents, an aspect that is often ignored in most NLU datasets. Recent efforts to create such datasets, often contribute to the current trend of creating datasets containing single-intent utterances and tend to focus solely on the simple case of concatenating two single-intent utterances with conjunction. However, in real conversation scenarios, the two utterances may have the same referents or share common verbs and nouns, resulting in anaphoric, cataphoric, or elliptical constructions respectively. The primary objective of this thesis is to create a dataset consisting of multi-intent utterances that incorporate the linguistic phenomena of anaphora, cataphora and ellipsis. These utterances were created by deploying the pre-existing CLINC150 dataset. Regarding the construction of these anaphoric, cataphoric and elliptical structures, the English-GUM corpus was employed. However, the incorporation of these complex linguistic phenomena within the dataset necessitated the creation of the dataset through a manual process. The annotation process undertaken for this dataset was carried out by Canadian native speakers of English who volunteered their expertise as annotators during the dataset evaluation. Finally, two baseline experiments were carried out on the dataset: a multi-label learning technique treating double intents as an atomic label, and a threshold-based multi-label approach predicting single or double intents based only on single intents. The experimental results have indicated that the first approach exhibited positive outcomes compared to the threshold-based approach, which yielded less satisfactory results. Nevertheless, employing solely single intent labels for predicting both single and double intents could be a more effective strategy, especially considering its independence from double intents in the training set.
Keywords:
multi-intent dataset, multi-intent classification, dialogue systems, cataphora, anaphora, ellipsis