Cancer type prediction with machine learning algorithms, a bio-centric approach

Postgraduate Thesis uoadl:3401625 11 Read counter

Unit:
Κατεύθυνση Βιοπληροφορική-Υπολογιστική Βιολογία
Library of the School of Science
Deposit date:
2024-06-21
Year:
2024
Author:
Sleiman Ilias
Supervisors info:
Βασιλική Οικονομίδου, Αναπληρώτρια Καθηγήτρια, Τμήμα Βιολογίας, ΕΚΠΑ, (Επιβλέπουσα)
Αριστοτέλης Χατζηιωάννου, Ερευνητής Α', Κέντρο Συστημικής Βιολογίας , Ίδρυμα Ιατροβιολογικών Ερευνών Ακαδημίας Αθηνών
Ιωάννης Τρουγκάκος Καθηγητής Τμήμα Βιολογίας, ΕΚΠΑ
Original Title:
Πρόβλεψη τύπου καρκίνου με αλγόριθμους μηχανικής μάθησης, μια βιολογικοκεντρική προσέγγιση
Languages:
Greek
Translated title:
Cancer type prediction with machine learning algorithms, a bio-centric approach
Summary:
Triple-negative breast cancer is a subtype of breast cancer in which the estrogen receptor (ER), progesterone receptor (PR) and the receptor responsible for enhancing human epidermal growth factor (HER2) are downregulated. The development of effective diagnostic and therapeutic methods for TNBC remains one of the greatest challenges in the field of oncology. Machine learning is emerging as a powerful tool in addressing this challenge, focusing on developing algorithms and statistical models that allow computers to learn from data and make predictions or decisions based on the data, without being explicitly programmed for each task. The integration of machine learning into the field of biology has demonstrated impressive results such as in genomic analysis for disease research, achieving early and accurate prediction and diagnosis of many diseases and greatly improved clinical decision making.
In this thesis, an attempt was made to integrate gene expression microarray data and biological ontological terms to train machine learning algorithms capable of categorizing patients based on a specific type of breast cancer. The two classes of classification are TNBC and Non-TNBC. The machine learning algorithms were trained using transcriptomic data of genes identified as important in TNBC (TNBC gene signature) and transcriptomic data of the IRE1 signature (IRE1sign38). This protein has been shown to be a powerful biological marker for many rare and aggressive cancers. Its high activity has been associated with increased cancer aggressiveness and a lower overall survival probability, via the IRE1-XBP1 pathway. To derive the gene signature of TNBC, we first performed a differential expression analysis and then we performed functional enrichment analysis to identify the most important genes based on the processes and molecular pathways in which they participate and describe the phenotypic characteristics of TNBC. The results for IRE1 activity and its XBP1 and RIDD components were converted into patient stratification labels according to the activity of each sample, to be used as additional training features of the machine learning algorithm. In the final stage of the study, we trained enough classification algorithms. Among these, Random Forest and Generalized Linear Models (Lasso and Elastic-Net Regularized) showed the most promising performances. Then, using only the most important training features, we significantly improved the prediction performance of these algorithms, resulting in the two most dominant categorization models: 'RF_Top_50' and 'GLM.4'. Our findings revealed the significance of the hub genes derived from the functional analysis and the importance of the signaling activity of IRE1 protein in TNBC, which makes it a powerful biological marker for the prognosis of this cancer type. By refining certain analysis techniques and incorporating additional omics data and ontological terms of processes and molecular pathways, the models we developed can provide a powerful tool in the hands of the medical community for the prognosis and development of therapeutic strategies for breast cancers, such as TNBC.
Main subject category:
Science
Keywords:
Triple-negative breast cancer, Breast cancer, Machine Learning, Prediction, bio-centric approach, microarrays data, ontologies, TNBC, IRE1 signature, rare cancers, functional enrichment analysis, classification
Index:
No
Number of index pages:
0
Contains images:
Yes
Number of references:
158
Number of pages:
147
File:
File access is restricted until 2026-06-26.

Ilias_Sleiman_THESIS_Final_2024.pdf
5 MB
File access is restricted until 2026-06-26.