Ομαδοποίηση μεγάλης κλίμακας δεδομένων στην Πλατφόρμα Spark
Large-scale data clustering in Spark
KeywordsΟμαδοποίηση ; Big data ; Clustering ; Spark ; MLlib ; Python ; K-means ; Bisecting K-means ; Gaussian Mixture Model ; Power Iteration Clustering ; Machine learning
The modern age we live in is characterized as a ”Big Data” era due to the increase in daily produced data. These data are now the basic source of mining knowledge. Recent estimates indicate that the volume of data produced every two days is equal to the number of data created since the beginning of mankind until 2003. For the analysis of these data, traditional data analysis tools are not sufficient for such processes. New tools for analyzing large-scale data are thus constantly being developed. Based on these needs, the present dissertation deals with the Large Data Clustering on the Spark Platform. In the first chapter the reader is introduced into the concept of Large-Scale Data. More specifically, we present their development and challenges, as well as the ways that they can be produced and acquired. The second chapter introduces the concept of Clustering. We present the distance and similarity measures for each type of data, as well as the measures used for clusters. The categories of clustering algorithms are listed below, as well as the categories used for clustering large scale data. In the next chapter we present the Spark platform, which is widely used for large-scale data analysis. More specifically, the components of the platform as well as its various libraries are presented, including MLlib and PySpark, which are used for the analysis in this dissertation. The last chapter describes and compares the results of the clustering algorithms found in the MLlib library through various evaluation measures calculated for each case.