A comparative evaluation of machine learning algorithms: binary classification on medical data
Master Thesis
Author
Γουμενάκης, Παναγιώτης
Goumenakis, Panagiotis
Date
2019-09Advisor
Πρέντζα, ΑνδριάναView/ Open
Keywords
Machine learning algorithms ; Binary classification ; SVM ; Naïve Bayes ; Decision trees ; Logistic regression ; ANN ; Mesothelioma dataset ; UCIAbstract
Nowadays Machine Learning (ML) has been well applied and recognised as an effective tool to handle a wide range of real situations, including medical implementations. As the amount of data in the field of healthcare grows year by year, there is a remarkable development in disease forecasting with the help of ML applications. From the prediction of epidemic outburst and several diseases to contributing with better means of labelling and storing healthcare data, implementation of ML in the field of healthcare indicates accurate results.
This thesis focuses mainly on two major aspects of ML areas. Firstly, on analysing a medical dataset providing visualisations together with invaluable information on dataset’s variables. Secondly, it emphasizes on implementing the appropriate algorithms to execute binary classification in order to determine whether a person is labelled as infected or not infected based on feature values of the sample set. Choosing the most suitable approach is crucial as it could potentially improve the clinical decisions as well as patients’ survival time when applied to real world problems.
The research is based on the mesothelioma disease dataset, allocated on the UCI repository, containing 324 examples with 35 attributes. Regarding the unsupervised learning part, in order to deduct results and conclusions, various ML classification algorithms are used to perform the analysis such as Decision Tree, Support Vector Machines (SVM), Naive Bayes Classifier, Logistic Regression, k Nearest Neighbours (kNN), and Artificial Neural Networks (ANN).
Concerning the techniques for evaluation, the reader can expect several methods as for example statistical measures like accuracy, sensitivity, specificity, f1-score, confusion matrix, AUC (Area Under Curve), and ROC (Receiver Operating Characteristic) curve.