Data mining, cleaning, feature extraction, and machine learning approaches for big data in electronic health records : liver cancer risk factor analysis and model explainability
Master Thesis
Author
Kouremenou, Eleftheria Georgia
Κουρεμένου, Ελευθερία Γεωργία
Date
2023View/ Open
Keywords
Explainability ; PySpark ; Liver cancer ; Apache Spark ; Parallel processing ; Machine learning ; Data cleaning ; Data mapping ; Big dataAbstract
In this Thesis, we propose a comprehensive methodology that employs advanced machine learning models and big data processing techniques for predicting liver cancer. We first performed data cleaning and mapping on a vast dataset, making use of tools such as Apache Sedona Spark and Google Colab to optimize the joining and processing of these large data resources. An essential part of our methodology involved the translation and transformation of blood values from one language to English, and from characters to double format. Moreover, we computed the average value of each patient's blood results. Our dataset comprises of records of patients with and without cancer. If a patient's record exists in the cancer dataset, we assign y = 1, indicating the presence of cancer; otherwise, y=0, indicating non-cancerous. Our predictive models take into account various external factors that may contribute to the disease and translate icd9 and icd10 protocols , such as complications from drug use, surgery, organ removal, as well as demographic factors like age and sex, and health conditions such as cirrhosis , hepatitis b. These factors were assessed using various machine learning models including unsupervised learning, supervised learning, LightGBM, XGBoost, Support Vector Machine, and Gradient Boosting. The models' outputs were evaluated and compared, with the most important features found to include age, marital status (MER), sex type, and the above-mentioned health conditions.Finally we include a powerful Explainability implementation.