This thesis concerns the methodology of model-based clustering as an
alternative to the classical distance-based clustering techniques. Also, this
thesis emphasize to the model-based clustering on high-dimensional data as well
as in the use of multivariate t distribution instead of multivariate normal
which is broadly in use. Moreover, there are plenty of examples and
applications for the best understanding and explanation of the methods.
In more detail, before the case of high-dimensional data and t distribution
there is an introduction in the philosophy of model-based clustering. There is
some theory about mixtures of multivariate normal distributions and how those
mixtures are used in the frame of model-based clustering. An extensive
reference in the use of the EM algorithm for the parameters estimation exists
as well. Furthermore, the models of the GPCM family are described in detail and
there is also a discussion around controversial issues of model-based
clustering such as model selection techniques, proper number of groups, initial
Then, the case of clustering of high-dimensional data is presented. Problems in
the use of GPCM family for the case of high-dimensioanl data are discussed also
and the use of factor-analyzers is proposed as an alternative. Also, there is a
full description of the implementation of the AECM algorithm which is used for
parameter estimation in that case.
Furthermore, we present two families models (PGMM και EPGMM) which are
appropriate for the clustering of high-dimensional data and are based on
mixtures of multivariate normal distributions (more specifically PGMM is nested
in EPGMM). Applications of those families are provided and at the same time
their advantages and disadantages are discussed as well.
Then, the case of multivariate t distribution and the benefits of its use is
presented for the clustering of both high-dimensional data and not. Also, the
AECM and EM algorithms respectively are fully described for this case.
Moreover, the MMtFA family models is also presented for the case of t factor
analyzers accordingly to the EPGMM family for the case of normal factor
At the end, there is an application of the PGMM and MMtFA family models on
high-dimensional data from the gene expression study of van 't Veer et al.
Data concern the expression of 24.182 genes (variables) from 78 women
(observations) suffered from breast cancer. Initially the UUU (PGMM) and UUC
(MMtFA) models are implemented for 100 random genes. Then, we move to
model-based clustering using all models of PGMM family based on 646 genes,
which were "appropriately" selected through a technique similar to EMMIX-GENE
(by using the normal distribution instead of t)
In summary, model-based clustering appears to be a powerful tool in the
clustering problem and especially in the case of high-dimensional data.
However, there are some problems to overcome and the most basic among them
appears to be the development of an efficient criterion for model selection.
Model-based clustering, High-dimensional data, EPGMM, MMtFA, EMMIX-GENE