Supervisors info:
Παναγιώτης Σταματόπουλος, Επίκουρος Καθηγητής, Τμήμα Πληροφορικής και Τηλεπικοινωνιών, Εθνικό και Καποδιστριακό Πανεπιστήμιο Αθηνών
Summary:
Hate Speech is abusive or stereotyping speech against a group of people, based on characteristics
such as race, religion, sexual orientation and gender. It is illegal based on the
current legislation in the USA and the EU, however the Internet and social media made it
possible to spread hatred easily, fast and anonymously. The large scale of data produced
through social media platforms requires the development of an effective automatic model
to detect such content. We study the performance of several text representation techniques
and classification algorithms, aiming to efficiently handle the online abusive language
discrimination task. We examine various representation techniques such as Bag
of Words (BoW), word and character Bag of n-grams, sentiment, syntax and grammar
analysis, word embeddings and n-gram graphs. In addition, we test multiple classification
algorithms: Naive Bayes, Logistic Regression, Random Forests, K-Nearest Neighbors
and Aritificial Neural Networks. Our goal is to evaluate representation and classification
algorithms with respect to their contribution to performance in the Hate Speech detection
task. Moreover, we highlight the utility of n-gram graphs (NGGs) as an efficient, lowdimensional
text representation that constructs similarity vectors which appear to constitute
deep features with significant contribution to the classification results. Apart from the
binary classification experiments, we additionally test our method in multi-class classification
experiments on abusive language discrimination tasks. Our results showe that NGGs
are informative and rich features - despite being represented by vectors with dimensions
equal to the number of possible classes - performing slightly worse than the Bag of Words
and word embeddings, which are in contrast constitute by high-dimensional representations.
We furthermore execute statistical tests, to examine whether NGGs have significant
contribution to the results. The tests not only showe that NGGs are significant features
with respect to the classification result, but also that the combination of the three best
performing features (BoW, NGGs and word embeddings) achieves the best classification
performance, with the use of the remaining text representations yielding deteriorated
results. Finally, the classification algorithm selection seems to be less important, since
statistical results for all the tested algorithms are similar.
Keywords:
natural language processing, machine learning, hate speech, classification