Ανάλυση ελληνικών σωμάτων κειμένων με τη χρήση τεχνικών μηχανικής μάθησης: Υπολογιστική αναπαράσταση της ιδιολέκτου.

Περήφανος Κωνσταντίνος

Unit:

Department of Philology
Library of the School of Philosophy

Deposit date:

2019-03-28

Year:

2019

Author:

Perifanos Konstantinos

Dissertation committee:

Διονύσιος Γούτσος, Καθηγητής Γλωσσολογίας, Τμήμα Φιλολογίας, Φιλοσοφική Σχολή ΕΚΠΑ
Γεώργιος Μικρός, Καθηγητής Υπολογιστικής Γλωσσολογίας, Τμήμα Ιταλικής Γλώσσας και Φιλολογίας, Φιλοσοφική Σχολή, ΕΚΠΑ
Γεώργιος Μαρκόπουλος, Αναπληρωτής Καθηγητής Γλωσσολογίας, Τμήμα Φιλολογίας, Φιλοσοφική Σχολή ΕΚΠΑ Σπυριδούλα Μπέλλα, Καθηγήτρια Πραγματολογίας, Τμήμα Φιλολογίας, Φιλοσοφική Σχολή ΕΚΠΑ
Σταματία Κουτσουλέλου, Αναπληρώτρια Καθηγήτρια Γλωσσολογίας, Τμήμα Φιλολογίας, Φιλοσοφική Σχολή ΕΚΠΑ
Θεμιστοκλής Παναγιωτόπουλος, Καθηγητής Τεχνητής Νοημοσύνης, Τμήμα Πληροφορικής, Πανεπιστήμιο Πειραιώς
Άγγελος Πικράκης, Επίκουρος Καθηγητής Μηχανικής Μάθησης, Τμήμα Πληροφορικής, Πανεπιστήμιο Πειραιώς

Original Title:

Ανάλυση ελληνικών σωμάτων κειμένων με τη χρήση τεχνικών μηχανικής μάθησης: Υπολογιστική αναπαράσταση της ιδιολέκτου.

Languages:

Greek

Translated title:

Greek corpora analysis using Machine Learning techniques: Computational representation of idiolect.

Summary:

Idiolect, as a term in linguistics, refers to the unique and distinctive use of language by an individual and is the individual counterpart of sociolect. Research on idiolect has so far been rather neglected in sociolinguistics, especially as concerns its validation by empirical means. Research on idiolect in corpus linguistics and stylometry has also been limited in terms of either the number of authors examined (typically less that 10 authors) or the number of vocabulary items used in the examination of idiolectal similarity (up to ~310 functional words). This thesis employs learning distributed representations or lexical embeddings to analyse texts by social media users that are considered to reflect their writing style. Data include a Twitter corpus of Greek texts, posted by 4.494 users from 2009 to 2016 (325 million words approx.) and the Blog Authorship Corpus, used for comparison. Based on Zellig Harris’ Distributional Hypothesis, according to which semantically similar words tend to appear in the same contexts, the notion of lexical (or word) embeddings can be used to answer the question of idiolect, providing thus a stylistic fingerprint for the authors involved. The performance of various models of distributed representation are explored and compared; in particular, these involve lexical embeddings produced by Neural Probabilistic Language models (namely, word2vec, fastText and doc2vec) and matrix factorization (namely, GloVe). The selected models are applied to the entire vocabulary of the texts concerned and thus are not limited by corpus vocabulary size and are scalable to thousands of authors.It is found that idiolect embeddings a) can be used to represent the style of individual authors and b) can provide the means of clustering users in terms of their idiolectal similarity, revealing clusters of the same style, as well as the means of quantifying idiolect stability over time. The findings have considerable applications in areas such as authorship attribution, plagiarism detection, online harassment and abuse. Furthermore, this is the first extended study of idiolect in Greek texts, using machine learning methods, something which suggests that lexical embeddings can be fruitfully employed in further areas of research in this language.

Main subject category:

Technology - Computer science

Keywords:

Corpora, Idiolect, Machine Learning, Neural Networks, Word embeddings

Index:

Yes

Number of index pages:

Contains images:

Yes

Number of references:

199

Number of pages: