Hate Speech Detection using different text representations in online user comments

Postgraduate Thesis uoadl:2800879 392 Read counter

Unit:
Κατεύθυνση / ειδίκευση Τεχνολογίες Πληροφορικής και Επικοινωνιών (ΤΠΕ)
Πληροφορική
Deposit date:
2018-10-05
Year:
2018
Author:
Themeli Chrysoula
Supervisors info:
Παναγιώτης Σταματόπουλος, Επίκουρος Καθηγητής, Τμήμα Πληροφορικής και Τηλεπικοινωνιών, Εθνικό και Καποδιστριακό Πανεπιστήμιο Αθηνών
Original Title:
Hate Speech Detection using different text representations in online user comments
Languages:
English
Translated title:
Hate Speech Detection using different text representations in online user comments
Summary:
Hate Speech is abusive or stereotyping speech against a group of people, based on characteristics
such as race, religion, sexual orientation and gender. It is illegal based on the
current legislation in the USA and the EU, however the Internet and social media made it
possible to spread hatred easily, fast and anonymously. The large scale of data produced
through social media platforms requires the development of an effective automatic model
to detect such content. We study the performance of several text representation techniques
and classification algorithms, aiming to efficiently handle the online abusive language
discrimination task. We examine various representation techniques such as Bag
of Words (BoW), word and character Bag of n-grams, sentiment, syntax and grammar
analysis, word embeddings and n-gram graphs. In addition, we test multiple classification
algorithms: Naive Bayes, Logistic Regression, Random Forests, K-Nearest Neighbors
and Aritificial Neural Networks. Our goal is to evaluate representation and classification
algorithms with respect to their contribution to performance in the Hate Speech detection
task. Moreover, we highlight the utility of n-gram graphs (NGGs) as an efficient, lowdimensional
text representation that constructs similarity vectors which appear to constitute
deep features with significant contribution to the classification results. Apart from the
binary classification experiments, we additionally test our method in multi-class classification
experiments on abusive language discrimination tasks. Our results showe that NGGs
are informative and rich features - despite being represented by vectors with dimensions
equal to the number of possible classes - performing slightly worse than the Bag of Words
and word embeddings, which are in contrast constitute by high-dimensional representations.
We furthermore execute statistical tests, to examine whether NGGs have significant
contribution to the results. The tests not only showe that NGGs are significant features
with respect to the classification result, but also that the combination of the three best
performing features (BoW, NGGs and word embeddings) achieves the best classification
performance, with the use of the remaining text representations yielding deteriorated
results. Finally, the classification algorithm selection seems to be less important, since
statistical results for all the tested algorithms are similar.
Main subject category:
Technology - Computer science
Keywords:
natural language processing, machine learning, hate speech, classification
Index:
Yes
Number of index pages:
5
Contains images:
Yes
Number of references:
34
Number of pages:
87
Thesis.pdf (802 KB) Open in new window