Pergamos - Library and Information Center of National and Kapodistrian University of Athens

Unit:

Department of Informatics and Telecommunications
Πληροφορική

Deposit date:

2022-03-09

Year:

2022

Author:

KAMPILI PANAGIOTA

Supervisors info:

Εμμανουήλ Κουμπαράκης, Καθηγητής, Τμήμα Πληροφορικής & Τηλεπικοινωνιών, Εθνικό & Καποδιστριακό Πανεπιστήμιο Αθηνών

Original Title:

Large-Scale Multi-label Classification of Greek legislation

Languages:

English

Translated title:

Large-Scale Multi-label Classification of Greek legislation

Summary:

Natural Language Processing is an area in Artificial Intelligence that is constantly attracting scientific interest and facilitates everyday tasks.
We focus on a specific case of multi-label classification problem, which over time and with the constantly increasing volume of data, becomes more and more frequent. Large-scale Multi-label Text Classification is characterized by large label space typically organized in a hierarchical manner and unbalanced label distributions. Our area of interest is the legal domain and we chose to experiment with the Greek language and more specifically, ”RAPTARCHIS47k“, a dataset consisting of more than forty seven thousand Greek legal documents. Objective of this thesis constitutes the hands-on evaluation of multi-label approaches on Greek legal docu-ments, the comparison of LMTC dedicated techniques to general state-of-the-art methods and the experimentation of learning to predict labels that rarely occur in the training set. We focus on some of the most well-known and promising hierarchical Probabilistic Label Tree methods, hybrid PLT-neural network methods, and we further experiment with transfer learning utilizing the latest transformer-based approaches. We evaluate these methods on three different levels of frequency (all-labels, frequent, few-case), and we investigate a multitude of configurations for every method separately. Our experiments showed that there is no rule of thumb about what method should be used, as different approaches gave the best performance in all three sub-tasks. Cutting edge technology Transformer-based models gave the best performance in sub-tasks, where the common labels dominate the hierarchy, while PLTs proved their supremacy on the task involving tail labels. As far as we know the scientific area of Large-scale Multi-label Text Classification is vastly understudied, especially for the Greek language, and we hope that this study will be a reference point for future research.

Main subject category:

Technology - Computer science

Keywords:

Legal Documents, Multi-label Classification, Probabilistic Label Trees, Neural Networks

Index:

Yes

Number of index pages:

Contains images:

Yes

Number of references:

Number of pages:

File: