Σύγκριση αλγορίθμων εξόρυξης γνώσης από πολύ μεγάλες βάσεις δεδομένων
Comparison of data mining algorithms from very large databases
KeywordsΒάσεις δεδομένων -- Διαχείριση
The century we are living is undoubtedly the century of information. The growth of the Internet and its use in everyday life, created a space where huge amounts of information are daily added. Τhe research department of IBM estimates that only social networking service Facebook, adds each day 100 Terabytes of information. It is also estimated that by 2020 the traffic volume of information in social media will exceed 35 Zettabytes. To understand how big this amount of information is, it is worth mentioning that 1 Zettabyte is equal to 1021 bytes or equal to 1012 Gigabytes. Although this amount of data seems unreal, it is worth noting that it is only a small percentage of the overall data that will be move via the internet because the idea of the Internet of Things (IOT) has already started to become reality. Unfortunately, the fact that data exists does not means knowledge exists too "We are drowning in data, but starving for knowledge - anonymous". So in order to convert the society of information to society of knowledge, we need to find fast and efficient ways of management and analysis, which can fast extract reliable knowledge from huge volumes of data. Nowadays many research teams turn to this direction trying to contribute to the transformation of the large volume data into knowledge. One of the promising areas for extracting knowledge from large volumes of data is Data Mining. The last few years many algorithms have been discovered in order to analyze data. In most cases these algorithms are complex to be implemented by a "simple" user, which makes data analysis an extremely difficult process for non-specialists. For this reason many user friendly software packages have been developed that allow the end user to apply these algorithms to his data. In this thesis we present clustering algorithms which are included in the popular data analysis software WEKA in order to study and compare their ability to manage large data files. Also, we implemented and integrated the algorithm CURE (Clustering Using REpresentatives) into WEKA software, which is considered to be one of the most promising algorithms in data mining due to its ability to manage large volumes of data and identification of outliers. Through a large number of experiments, we present results that show the data processing limits for each algorithm in WEKA, as well as their corresponding execution times as a function of the number of records and attributes respectively.