Ομαδοποίηση μεγάλης κλίμακας δεδομένων στην Πλατφόρμα Spark
Large-scale data clustering in Spark

View/ Open
Keywords
Ομαδοποίηση ; Big data ; Clustering ; Spark ; MLlib ; Python ; K-means ; Bisecting K-means ; Gaussian Mixture Model ; Power Iteration Clustering ; Machine learningAbstract
The modern age we live in is characterized as a ”Big Data” era due to the increase in daily produced data. These
data are now the basic source of mining knowledge. Recent estimates indicate that the volume of data produced
every two days is equal to the number of data created since the beginning of mankind until 2003.
For the analysis of these data, traditional data analysis tools are not sufficient for such processes. New tools for
analyzing large-scale data are thus constantly being developed.
Based on these needs, the present dissertation deals with the Large Data Clustering on the Spark Platform. In
the first chapter the reader is introduced into the concept of Large-Scale Data. More specifically, we present their
development and challenges, as well as the ways that they can be produced and acquired.
The second chapter introduces the concept of Clustering. We present the distance and similarity measures for
each type of data, as well as the measures used for clusters. The categories of clustering algorithms are listed
below, as well as the categories used for clustering large scale data.
In the next chapter we present the Spark platform, which is widely used for large-scale data analysis. More
specifically, the components of the platform as well as its various libraries are presented, including MLlib and
PySpark, which are used for the analysis in this dissertation.
The last chapter describes and compares the results of the clustering algorithms found in the MLlib library
through various evaluation measures calculated for each case.