A computational pipeline for data augmentation towards the improvement
of disease classification and risk stratification models: A case study
in two clinical domains

Pezoulas, Vasileios C.; Grigoriadis, I, Grigoris; Gkois, George; and Tachos, Nikolaos S.; Smole, Tim; Bosnic, Zoran; Piculin,; Matej; Olivotto, Iacopo; Barlocco, Fausto; Robnik-Sikonja,; Marko; Jakovljevic, Djordje G.; Goules, Andreas; Tzioufas,; Athanasios G.; Fotiadis, I, Dimitrios

doi:10.1016/j.compbiomed.2021.104520

Μονάδα:

Ερευνητικό υλικό ΕΚΠΑ

Τίτλος:

A computational pipeline for data augmentation towards the improvement
of disease classification and risk stratification models: A case study
in two clinical domains

Γλώσσες Τεκμηρίου:

Αγγλικά

Περίληψη:

Virtual population generation is an emerging field in data science with
numerous applications in healthcare towards the augmentation of clinical
research databases with significant lack of population size. However,
the impact of data augmentation on the development of AI (artificial
intelligence) models to address clinical unmet needs has not yet been
investigated. In this work, we assess whether the aggregation of real
with virtual patient data can improve the performance of the existing
risk stratification and disease classification models in two rare
clinical domains, namely the primary Sjo & uml;gren’s Syndrome (pSS)
and the hypertrophic cardiomyopathy (HCM), for the first time in the
literature. To do so, multivariate approaches, such as, the multivariate
normal distribution (MVND), and straightforward ones, such as, the
Bayesian networks, the artificial neural networks (ANNs), and the tree
ensembles are compared against their performance towards the generation
of high-quality virtual data. Both boosting and bagging algorithms, such
as, the Gradient boosting trees (XGBoost), the AdaBoost and the Random
Forests (RFs) were trained on the augmented data to evaluate the
performance improvement for lymphoma classification and HCM risk
stratification. Our results revealed the favorable performance of the
tree ensemble generators, in both domains, yielding virtual data with
goodness-of-fit 0.021 and KL-divergence 0.029 in pSS and 0.029, 0.027 in
HCM, respectively. The application of the XGBoost on the augmented data
revealed an increase by 10.9% in accuracy, 10.7% in sensitivity,
11.5% in specificity for lymphoma classification and 16.1% in
accuracy, 16.9% in sensitivity, 13.7% in specificity in HCM risk
stratification.

Έτος δημοσίευσης:

2021

Συγγραφείς:

Pezoulas, Vasileios C.
Grigoriadis, I, Grigoris
Gkois, George
and Tachos, Nikolaos S.
Smole, Tim
Bosnic, Zoran
Piculin,
Matej
Olivotto, Iacopo
Barlocco, Fausto
Robnik-Sikonja,
Marko
Jakovljevic, Djordje G.
Goules, Andreas
Tzioufas,
Athanasios G.
Fotiadis, I, Dimitrios

Περιοδικό:

Computers in Biology and Medicine

Εκδότης:

PERGAMON-ELSEVIER SCIENCE LTD

Τόμος:

134

Λέξεις-κλειδιά:

Artificial intelligence; Data augmentation; Virtual population
generation; Lymphoma classification; HCM risk stratification

Επίσημο URL (Εκδότης):

https://doi.org/10.1016/j.compbiomed.2021.104520

DOI:

10.1016/j.compbiomed.2021.104520