Αλγόριθμοι ομογενοποίησης και διασφάλισης ποιότητας δεδομένων

View/ Open
Keywords
Ομογενοποίηση δεδομένων ; Ετερογενείς πηγές δεδομένων ; Καθαρισμός δεδομένων ; Ποιότητα δεδομένων ; Τεχνικές προ-επεξεργασίας ; Αλγόριθμοι επεξεργασίας δεδομένωνAbstract
Over the last decade, the volume of data being generated has drastically increased, and this trend is expected to continue in the coming years. In the healthcare domain, this growth is particularly significant, as data production is rapidly evolving and holds great potential for advancing clinical practice and research. The ultimate purpose of collecting such large volumes of data is to enable accurate and targeted predictions that can improve decision-making. To achieve this, however, not only is a substantial quantity of data is required, but also data of high quality that can be reliable. In practice, data often comes in various forms and formats, with similar information being stored under different variable names or structures. This creates significant heterogeneity challenges and makes the process of data homogenization and integration essential. Data homogenization aims to combine data from multiple autonomous and heterogeneous sources into a unified dataset by addressing issues such as duplicate records, and standardizing data, so that uniform access and a consolidated view can be achieved. At the same time, raw data often contains numerous erroneous values that compromise their quality. For this reason, preprocessing through appropriate data cleaning techniques is an imperative step, including a series of actions to ensure accuracy, completeness, and reliability. To this end, this thesis first provides a literature review on existing methods for integrating heterogeneous data sources, followed by techniques and algorithms for data cleaning to ensure data quality. Furthermore, a practical environment is proposed, allowing users to apply these algorithms to healthcare data, given their particular importance, in order to homogenize, clean, and ensure the reliability of the data. Finally, users are able to evaluate the performance of each method and obtain additional insights regarding their comparative effectiveness.


