Parallel Fuzzy K-Means clustering on top of an OLAP database

Κιούρτης, Αθανάσιος

dc.contributor.advisor	Πρέντζα, Ανδριάνα
dc.contributor.advisor	Peris, Ricardo Jimenez
dc.contributor.author	Κιούρτης, Αθανάσιος
dc.date.accessioned	2015-09-26T11:29:17Z
dc.date.available	2015-09-26T11:29:17Z
dc.date.issued	2015-02-13
dc.identifier.uri	https://dione.lib.unipi.gr/xmlui/handle/unipi/7821
dc.format.extent	120	el
dc.language.iso	en	el
dc.publisher	Πανεπιστήμιο Πειραιώς	el
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 Διεθνές	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/	*
dc.subject	Data mining	el
dc.subject	Database management	el
dc.subject	Electronic data processing -- Distributed processing	el
dc.subject	Εξόρυξη δεδομένων	el
dc.subject	Βάσεις δεδομένων -- Διαχείριση	el
dc.subject	Ηλεκτρονική επεξεργασία δεδομένων	el
dc.title	Parallel Fuzzy K-Means clustering on top of an OLAP database	el
dc.type	Master Thesis	el
dc.contributor.department	Σχολή Τεχνολογιών Πληροφορικής και Επικοινωνιών. Τμήμα Ψηφιακών Συστημάτων	el
dc.identifier.call	005.7422 ΚΙΟ	el
dc.description.abstractEN	The era we live in is attacked by the big data phenomenon. Multiple enterprises store large amounts of data for analysis, making the field of data analysis more and important. But how is someone able to gain insights from this massively evolving data? There comes the data mining field, which is what this thesis investigates, and especially the clustering field, whose algorithms can group unknown data into multiple clusters. Needing to perform clustering in a distributed way, the framework of Hadoop is analyzed, which offers a way to execute parallel programs on a cluster of machines, storing the data in a distributed file system. The parallel executions are achieved using the MapReduce paradigm which transforms complex computations over a set of <key, value> pairs, so that many jobs can run together. As for the distributed storing, it is achieved using the Hadoop distributed file system (HDFS), a file system that provides scalable and reliable data storage. One of the projects that uses Hadoop for performing clustering, classification and collaborative-filtering techniques to large data, is Mahout. Among all the clustering algorithms, the thesis gives details about a Fuzzy clustering algorithm, Fuzzy kMeans, which groups the data to clusters, by assigning to each data point different degrees of association for each cluster. The main problem that has to be solved is how we can cluster data stored in an OLAP database, in a parallel way. For that reason, it is studied how Mahout implements the fuzzy kMeans algorithm using the Hadoop's components, and after that, the thesis proposes a "similar approach" of the algorithm, running on top of an OLAP database. In more details, the new fuzzy kMeans implementation instead of MapReduce is using a three-stepped <key, value> pair idea (Map, Reduce & FinalReduce). In the first step, the clustering jobs are split and assigned to different threads which export their own <key, value> pairs. In the second step, each thread, without waiting for the other threads to finish their first step, continues using the extracted <key, value> pairs from its previous step and exports its own <key, values>. The last step, takes place when all of the threads finish the aforementioned two steps, and the results of the second step are merged together, to produce the final clustering result. In addition, instead of using the HDFS for storing purposes, the implementation is making use of an OLAP database. For the previous idea, a prototype implementation has been developed and preliminary tests were run, comparing the different fuzzy kMeans clustering implementations. Finally, the thesis concludes with the fact that more attention should be given to the data that evolve day by day, as mining and extracting information out of it, becomes more and more complicated, whereas it could be considered as a very challenging and attractive job.	el
dc.corporate.name	Universidad Politecnica de Madrid. Escuela Tecnica Superior de Ingenieros Informaticos	el
dc.contributor.master	Ψηφιακά Συστήματα και Υπηρεσίες	el

Αρχεία σε αυτό το τεκμήριο

Name:: Kiourtis_Athanasios.pdf
Μέγεθος:: 2.548Mb
Τύπος:: PDF
Description:: Μεταπτυχιακή εργασία

Προβολή/Άνοιγμα

Αυτό το τεκμήριο εμφανίζεται στις ακόλουθες συλλογές

Τμήμα Ψηφιακών Συστημάτων
Department of Digital Systems

Εμφάνιση απλής εγγραφής

Attribution-NonCommercial-NoDerivatives 4.0 Διεθνές

Εκτός από όπου διευκρινίζεται διαφορετικά, το τεκμήριο διανέμεται με την ακόλουθη άδεια:
Attribution-NonCommercial-NoDerivatives 4.0 Διεθνές