Parallel Fuzzy K-Means clustering on top of an OLAP database
SubjectData mining ; Database management ; Electronic data processing -- Distributed processing ; Εξόρυξη δεδομένων ; Βάσεις δεδομένων -- Διαχείριση ; Ηλεκτρονική επεξεργασία δεδομένων
The era we live in is attacked by the big data phenomenon. Multiple enterprises store large amounts of data for analysis, making the field of data analysis more and important. But how is someone able to gain insights from this massively evolving data? There comes the data mining field, which is what this thesis investigates, and especially the clustering field, whose algorithms can group unknown data into multiple clusters. Needing to perform clustering in a distributed way, the framework of Hadoop is analyzed, which offers a way to execute parallel programs on a cluster of machines, storing the data in a distributed file system. The parallel executions are achieved using the MapReduce paradigm which transforms complex computations over a set of <key, value> pairs, so that many jobs can run together. As for the distributed storing, it is achieved using the Hadoop distributed file system (HDFS), a file system that provides scalable and reliable data storage. One of the projects that uses Hadoop for performing clustering, classification and collaborative-filtering techniques to large data, is Mahout. Among all the clustering algorithms, the thesis gives details about a Fuzzy clustering algorithm, Fuzzy kMeans, which groups the data to clusters, by assigning to each data point different degrees of association for each cluster. The main problem that has to be solved is how we can cluster data stored in an OLAP database, in a parallel way. For that reason, it is studied how Mahout implements the fuzzy kMeans algorithm using the Hadoop's components, and after that, the thesis proposes a "similar approach" of the algorithm, running on top of an OLAP database. In more details, the new fuzzy kMeans implementation instead of MapReduce is using a three-stepped <key, value> pair idea (Map, Reduce & FinalReduce). In the first step, the clustering jobs are split and assigned to different threads which export their own <key, value> pairs. In the second step, each thread, without waiting for the other threads to finish their first step, continues using the extracted <key, value> pairs from its previous step and exports its own <key, values>. The last step, takes place when all of the threads finish the aforementioned two steps, and the results of the second step are merged together, to produce the final clustering result. In addition, instead of using the HDFS for storing purposes, the implementation is making use of an OLAP database. For the previous idea, a prototype implementation has been developed and preliminary tests were run, comparing the different fuzzy kMeans clustering implementations. Finally, the thesis concludes with the fact that more attention should be given to the data that evolve day by day, as mining and extracting information out of it, becomes more and more complicated, whereas it could be considered as a very challenging and attractive job.