clustering and classification in high dimensional sparse data

Graduate Thesis uoadl:1324495 319 Read counter

Unit:
Τομέας Θεωρητικής Πληροφορικής
Library of the School of Science
Deposit date:
2016-03-19
Year:
2016
Author:
Πατσουράκος Κωνσταντίνος
Μπορεκτσίογλου Ιωάννης
Supervisors info:
Γιάννης Ζ. Εμίρης
Original Title:
clustering and classification in high dimensional sparse data
Languages:
English
Translated title:
ομαδοποίηση και κατηγοριοποίηση σε πολυδιάστατα αραιά δεδομένα
Summary:
The main goal of this dissertation can be accumulated as the effort of
classification of real high-dimensional sparse data in the area of homeopathy.
In order to achieve these goals there have been gathered various methodologies
from data mining area. Some suitable clustering algorithms were implemented
until there was a good and useful result according to field experts.

The biggest challenge was the absence of ground truth that would help lead the
attempts to better understand the problem. For that reason, we
had to rely on internal evaluation and experiment with different scoring
functions. Specifically in order to attain the above mentioned goals, a
partitional clustering algorithm was implemented. We started with k-medoids
approach with k-medoids++ initialization, PAM assignment (Leonard Kaufman and
Peter J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster
Analysis) and CLARANS update (Raymond T. Ng and Jiawei Han, "Efficient and
Effective Clustering Methods for Spatial Data Mining”). Because of the
hierarchical structure of data the above methods did not give useful results,
according to internal evaluation, so a hierarchical algorithm known as
connected compenents was implemented.

Lastly, in order to make some conclusions about the words that appeared in
data, we implemented hitting set algorithm. It was important to find the words
that appeared the most independently of the others and we saw the problem as
the known set covering problem.
Keywords:
clustering, classification, b ig data, sparse, high dimensional
Index:
Yes
Number of index pages:
9, 10, 11
Contains images:
Yes
Number of references:
26
Number of pages:
45
document.pdf (936 KB) Open in new window