Η εξάπλωση των κυριότερων ιών ηπατίτιδας σε παγκόσμια κλίμακα : ανάλυση δεδομένων και συμπεράσματα
The spread of the main hepatitis viruses on a global scale : data analysis and conclusions

View/ Open
Keywords
Ηπατίτιδα ; Ανάλυση δεδομένων ; Μηχανική μάθηση ; Machine learning ; Cluster analysis ; Hepatitis ; Data analysis ; SVM ; Ridge regression ; PCAAbstract
Hepatitis is a disease caused by inflammation of the liver and can be triggered by various viruses, alcohol consumption, medications, and other factors. The most well-known viral forms are hepatitis A, B, C, D, and E. Advances in medical science have significantly contributed to reducing the number of cases of this disease. Most countries have developed specific strategies aimed at limiting the spread of the disease. However, each country has its own philosophy, economy, and level of development, resulting in the hepatitis virus not spreading uniformly worldwide. Many countries continue to report a high number of cases even today.
The greatest global concern is caused by hepatitis C, followed by hepatitis B. For the purposes of this thesis, data were collected from the World Health Organization's website, relating to various characteristics of hepatitis B and C in the populations of several countries. Based on these characteristics, the statistical method Cluster Analysis is applied, with the aim of identifying homogeneous groups of countries that have similar approaches to manag-ing the hepatitis virus.
Next, Principal Component Analysis (PCA) is applied, which simplifies the collected data, which contain a large number of features related to the aforementioned types of hepatitis. Using the simplified data obtained from PCA, countries are re-clustered to compare the results and evaluate whether the groups created from the simplified data, which contain less “noise,” are more cohesive.
Furthermore, statistics are widely recognized as being highly useful not only for data analysis but also for predicting values. For this reason, two predictive models are applied at the end, which are extremely useful for improving the strategies implemented by each country. Initially, a Ridge Regression model is fitted to the data in order to predict the number of deaths caused by hepatitis B in a given country. Finally, the Support Vector Machine (SVM) method is applied, aiming to determine whether a country belongs to the category of countries effectively managing hepatitis B or to the category of countries where significant improvements are required.


