Classification of COVID-19 patients with the use of machine learning algorithms

Postgraduate Thesis uoadl:3246493 57 Read counter

Unit:
Κατεύθυνση Πληροφορική της Υγείας
Library of the School of Health Sciences
Deposit date:
2022-11-24
Year:
2022
Author:
Kourmpanis Nikolaos
Supervisors info:
Ιωάννης Μαντάς, Καθηγητής, Τμήμα Νοσηλευτικής, ΕΚΠΑ
Ιωσήφ Λίασκος, Ε.ΔΙ.Π., Τμήμα Νοσηλευτικής, ΕΚΠΑ
Εμμανουήλ Ζούλιας, Ε.ΔΙ.Π., Τμήμα Νοσηλευτικής, ΕΚΠΑ
Original Title:
Κατηγοριοποίηση ασθενών COVID-19 με τη χρήση αλγορίθμων μηχανικής μάθησης
Languages:
Greek
Translated title:
Classification of COVID-19 patients with the use of machine learning algorithms
Summary:
In late 2019, the novel coronavirus disease 2019 (COVID-19), which is caused by the SARS-CoV-2 virus, emerged. The virus was first detected in the city of Wuhan, China in December 2019 and has since spread worldwide as the new pandemic [1], which continues to this day. SARS-CoV-2 (Severe Acute Respiratory Syndrome Corona Virus 2) is spread airborne through respiratory droplets or aerosols produced by the cough or sneeze of an infected person. SARS-CoV-2, as an RNA virus, exhibits mutations due to its mode of reproduction, resulting in the creation of several variants which are classified as Variants of Concern (VOCs) and Variants of Interest (VOIs) by the World Health Organization (WHO). A key factor that increases the risk of complications and severity of illness with COVID-19 is age, as older people are more likely to get seriously ill from COVID-19.
Preventive measures taken worldwide to minimize transmission of SAR-CoV-2 included social distancing, indoor ventilation, covering the face when coughing or sneezing, washing hands, and wearing a face mask indoors and of course the vaccination. To date at least ten vaccines have been approved by at least one national regulatory authority for use in the general public.
In dealing with this global crisis, Artificial Intelligence comes to contribute through the possibilities it offers to create predictive models through Machine Learning algorithms. Machine Learning algorithms process knowledge and represent it in more mathematical ways and are successfully applied to solve a multitude of problems, in many scientific fields, such as Data Mining, Probability and Statistics, Neurobiology, etc.
The aim of this work is to compare different algorithmic models, with the aim of finding the best way to predict the mortality of patients with COVID-19, through 6 categorization algorithms, using data on the clinical characteristics and history of the patients. More specifically, the data set consists of 12,425,179 people suspected of having COVID-19, who attended health facilities in Mexico, with 3,993,464 of them being positive for SARS-CoV-2. The 6 algorithms used are Logistic Regression (LR), Decision Trees (DTs), Random Forest (RF), Extreme Gradient Boosting (XGB), Multilayer Perceptrons TNDs (MLPs) and K Nearest Neighbors (KNN).
After the completion of the Preprocessing of the data, which included the cleaning of the data from samples with missing values, the elimination of mainly geographic features (columns), which were not related to mortality from COVID-19 and finally the transformation of the continuous data with 6 different ways (No Scaling, Standard Scaling, Min-Max Scaling for ranges 0-1, 0-10, 0-100, 0-1000), the data was fed to the different models of the algorithms. For each algorithm, 54 models were generated (6 preprocessing modes x 3 feature sets x 3 hyperparameter sets) with each model running 10 iterations, with different subsets of the data set, to obtain the mean of the metrics of, thus reaching 540 executions for each algorithm, with a final total of 3,240 executions for all 6 algorithms.
Then we proceeded to evaluate them based on 5 metrics: Precision, Sensitivity-Recall, F1 Score, Area Under the ROC curve (AUC_ROC) and Runtime. The ranking of the algorithms, based on the performance of their models, brought 1st the Extreme Gradient Boost (XGB) models, 2nd the models of the Random Forest (RF), 3rd the models of Multilayer Perceptrons (MLPs), 4th the models of Trees Decision (DTs), 5th the models of K Nearest Neighbors (KNN) and 6th the models of Logistic Regression (LR). The optimal model was that of XGB which used all 22 features (columns), with the Min-Max scaler with a range of 0-100 and with the optimal_01 set of hyperparameter values, with a mean for the Precision metrics of 0.93764 (93.76%), Recall 0.95472 (95.47%), F1-score 0.9113, AUC_ROC 0.97855 and Runtime 6.67306 sec.
Main subject category:
Health Sciences
Keywords:
COVID-19, SARS-CoV-2, RNA viruses, Coronaviruses, Spike protein, ACE2 receptor, W.H.O., Wuhan, Variants of concern (VOCs), Variants of interest (VOIs), RT-PCR, mRNA vaccines, Viral vector vaccines, Artificial intelligence, Machine learning, Computational intelligence (CI), Fuzzy logic, Evolutionary algorithms (EA), Logistic regression (LR), Decision trees (DTs), DecisionTreeClassifier, Random forest (RF), RandomForestClassifier, Extreme gradiend boosting (XGB), XGBClassifier, Multi-layer perceptrons (MLPs), MLPClassifier, K-nearest neighbors (KNN), KNeighborsClassifier, Python, Pandas, Sklearn, GridSearchCV, 10-fold cross validation, Feature importance, Algorithm evaluation metrics, Confusion matrix, Precision, Recall, F1 score, Runtime, Area under the ROC curve (AUC_ROC)
Index:
No
Number of index pages:
0
Contains images:
Yes
Number of references:
500
Number of pages:
238
[Katigoriopoiisi.as8enwn.COVID-19.me.ti.xrisi.algori8mwn.ML]~[TELIKO-PERGAMOS].pdf (10 MB) Open in new window