The main goal of this dissertation can be accumulated as the effort of
classification of real high-dimensional sparse data in the area of homeopathy.
In order to achieve these goals there have been gathered various methodologies
from data mining area. Some suitable clustering algorithms were implemented
until there was a good and useful result according to field experts.
The biggest challenge was the absence of ground truth that would help lead the
attempts to better understand the problem. For that reason, we
had to rely on internal evaluation and experiment with different scoring
functions. Specifically in order to attain the above mentioned goals, a
partitional clustering algorithm was implemented. We started with k-medoids
approach with k-medoids++ initialization, PAM assignment (Leonard Kaufman and
Peter J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster
Analysis) and CLARANS update (Raymond T. Ng and Jiawei Han, "Efficient and
Effective Clustering Methods for Spatial Data Mining”). Because of the
hierarchical structure of data the above methods did not give useful results,
according to internal evaluation, so a hierarchical algorithm known as
connected compenents was implemented.
Lastly, in order to make some conclusions about the words that appeared in
data, we implemented hitting set algorithm. It was important to find the words
that appeared the most independently of the others and we saw the problem as
the known set covering problem.
clustering, classification, b ig data, sparse, high dimensional