Parallel k-means clustering on top of an olap database
SubjectCluster analysis ; Data mining ; Εξόρυξη δεδομένων ; Databases ; Βάσεις δεδομένων ; Ηλεκτρονική επεξεργασία δεδομένων ; Electronic data processing
Driven by advances in data collection and storage, increasingly large and high dimensional datasets are being stored. Without special tools, human analysts can no longer make sense of such massive volumes of data. As a consequence, intelligent Data Mining techniques are being developed either to semi-automate or totally automate the process of data mining. Clustering is an essential step in this process, as it reveals natural structures and identifies interesting patterns in the underlying data, by grouping high dimensional data together. In order to process this data, there have been proposed several clustering algorithms, such as the K-Means algorithm, which is well-known for its efficiency in clustering large datasets. At the same time, while the Data Mining techniques are necessary for managing all these continuously growing data, the use of the distributed systems has become a major issue. Without them, the computing environments would not be able to manage such large-scales of data with an overwhelmingly high performance. Hence, several distributed frameworks have been created, aiming both to use the tolerance and the efficiency of the distributed systems and the priceless abilities of the Data Mining techniques. This Thesis investigates the distributed framework of Apache Hadoop, a framework for large-scale data processing, where an entire ecosystem of self-standing projects has been built. Mahout is one of these projects, created to serve the various Data Mining issues. Thus, the Thesis studies about how the K-Means algorithm is used and implemented so as to process huge amounts of data using Mahout's machine learning libraries in combination with the Hadoop Distributed File System and the MapReduce key-value pair paradigm. Additionally, in that Thesis it was developed a prototype project using a new parallel version of K-Means, so as to solve our main problem of performing parallel clustering on top of an OLAP distributed database. That algorithm uses Mahout's libraries to achieve scalability, an OLAP database for data storing purposes, and a new key-value pair concept for processing in parallel this data. This key-value pair concept was based on the MapReduce paradigm, but working in a different way. In the new concept, there are three different phases in the whole procedure (Map-Reduce-FinalReduce). Firstly, during the Map phase there are created numerous threads running in parallel different Map jobs, producing different key- value pairs. Then, in the Reduce phase each thread continues its corresponding Reduce job, processing its Map's output key-value pairs, without waiting for the other threads to finish their Map tasks. Finally, in the FinalReduce phase all the output key-value pairs of the Reduce phase of each thread are merged and the final key-value pairs are created. Finally, summarized preliminary tests with experimental results of these implementations were stated, in order to verify that the developed prototype project is working properly, producing complete and reliable results.