Data mining, cleaning, feature extraction, and machine learning approaches for big data in electronic health records : liver cancer risk factor analysis and model explainability

Kouremenou, Eleftheria Georgia; Κουρεμένου, Ελευθερία Γεωργία

Master Thesis

Author

Kouremenou, Eleftheria Georgia

Κουρεμένου, Ελευθερία Γεωργία

Date

2023

Abstract

In this Thesis, we propose a comprehensive methodology that employs advanced machine learning models and big data processing techniques for predicting liver cancer. We first performed data cleaning and mapping on a vast dataset, making use of tools such as Apache Sedona Spark and Google Colab to optimize the joining and processing of these large data resources. An essential part of our methodology involved the translation and transformation of blood values from one language to English, and from characters to double format. Moreover, we computed the average value of each patient's blood results. Our dataset comprises of records of patients with and without cancer. If a patient's record exists in the cancer dataset, we assign y = 1, indicating the presence of cancer; otherwise, y=0, indicating non-cancerous. Our predictive models take into account various external factors that may contribute to the disease and translate icd9 and icd10 protocols , such as complications from drug use, surgery, organ removal, as well as demographic factors like age and sex, and health conditions such as cirrhosis , hepatitis b. These factors were assessed using various machine learning models including unsupervised learning, supervised learning, LightGBM, XGBoost, Support Vector Machine, and Gradient Boosting. The models' outputs were evaluated and compared, with the most important features found to include age, marital status (MER), sex type, and the above-mentioned health conditions.Finally we include a powerful Explainability implementation.

Postgraduate Studies Programme

Πληροφοριακά Συστήματα και Υπηρεσίες

Department

Σχολή Τεχνολογιών Πληροφορικής και Επικοινωνιών. Τμήμα Ψηφιακών Συστημάτων

Number of pages

Language

English

URI

https://dione.lib.unipi.gr/xmlui/handle/unipi/15692
http://dx.doi.org/10.26267/unipi_dione/3114

Collections

Τμήμα Ψηφιακών Συστημάτων

Show full item record

Except where otherwise noted, this item's license is described as
Αναφορά Δημιουργού-Μη Εμπορική Χρήση 3.0 Ελλάδα