Parallel k-means clustering on top of an olap database

Μαυρογιώργου, Αργυρώ

dc.contributor.advisor	Πρέντζα, Ανδριάνα
dc.contributor.advisor	Peris, Ricardo Jimenez
dc.contributor.author	Μαυρογιώργου, Αργυρώ
dc.date.accessioned	2015-09-21T14:22:20Z
dc.date.available	2015-09-21T14:22:20Z
dc.date.issued	2015-02-13
dc.identifier.uri	https://dione.lib.unipi.gr/xmlui/handle/unipi/7711
dc.format.extent	119	el
dc.language.iso	en	el
dc.publisher	Πανεπιστήμιο Πειραιώς	el
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 Διεθνές	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/	*
dc.subject	Cluster analysis	el
dc.subject	Data mining	el
dc.subject	Εξόρυξη δεδομένων	el
dc.subject	Databases	el
dc.subject	Βάσεις δεδομένων	el
dc.subject	Ηλεκτρονική επεξεργασία δεδομένων	el
dc.subject	Electronic data processing	el
dc.title	Parallel k-means clustering on top of an olap database	el
dc.type	Master Thesis	el
dc.contributor.department	Σχολή Τεχνολογιών Πληροφορικής και Επικοινωνιών. Τμήμα Ψηφιακών Συστημάτων	el
dc.identifier.call	001.644'2 ΜΑΥ	el
dc.description.abstractEN	Driven by advances in data collection and storage, increasingly large and high dimensional datasets are being stored. Without special tools, human analysts can no longer make sense of such massive volumes of data. As a consequence, intelligent Data Mining techniques are being developed either to semi-automate or totally automate the process of data mining. Clustering is an essential step in this process, as it reveals natural structures and identifies interesting patterns in the underlying data, by grouping high dimensional data together. In order to process this data, there have been proposed several clustering algorithms, such as the K-Means algorithm, which is well-known for its efficiency in clustering large datasets. At the same time, while the Data Mining techniques are necessary for managing all these continuously growing data, the use of the distributed systems has become a major issue. Without them, the computing environments would not be able to manage such large-scales of data with an overwhelmingly high performance. Hence, several distributed frameworks have been created, aiming both to use the tolerance and the efficiency of the distributed systems and the priceless abilities of the Data Mining techniques. This Thesis investigates the distributed framework of Apache Hadoop, a framework for large-scale data processing, where an entire ecosystem of self-standing projects has been built. Mahout is one of these projects, created to serve the various Data Mining issues. Thus, the Thesis studies about how the K-Means algorithm is used and implemented so as to process huge amounts of data using Mahout's machine learning libraries in combination with the Hadoop Distributed File System and the MapReduce key-value pair paradigm. Additionally, in that Thesis it was developed a prototype project using a new parallel version of K-Means, so as to solve our main problem of performing parallel clustering on top of an OLAP distributed database. That algorithm uses Mahout's libraries to achieve scalability, an OLAP database for data storing purposes, and a new key-value pair concept for processing in parallel this data. This key-value pair concept was based on the MapReduce paradigm, but working in a different way. In the new concept, there are three different phases in the whole procedure (Map-Reduce-FinalReduce). Firstly, during the Map phase there are created numerous threads running in parallel different Map jobs, producing different key- value pairs. Then, in the Reduce phase each thread continues its corresponding Reduce job, processing its Map's output key-value pairs, without waiting for the other threads to finish their Map tasks. Finally, in the FinalReduce phase all the output key-value pairs of the Reduce phase of each thread are merged and the final key-value pairs are created. Finally, summarized preliminary tests with experimental results of these implementations were stated, in order to verify that the developed prototype project is working properly, producing complete and reliable results.	en
dc.corporate.name	Universidad Politecnica de Madrid. Escuela Tecnica Superior de Ingenieros Informaticos	el
dc.contributor.master	Ψηφιακά Συστήματα και Υπηρεσίες	el

Αρχεία σε αυτό το τεκμήριο

Name:: Mavrogiorgou_Argyro.pdf
Μέγεθος:: 22.62Mb
Τύπος:: PDF
Description:: Μεταπτυχιακή εργασία

Προβολή/Άνοιγμα

Αυτό το τεκμήριο εμφανίζεται στις ακόλουθες συλλογές

Τμήμα Ψηφιακών Συστημάτων
Department of Digital Systems

Εμφάνιση απλής εγγραφής

Attribution-NonCommercial-NoDerivatives 4.0 Διεθνές

Εκτός από όπου διευκρινίζεται διαφορετικά, το τεκμήριο διανέμεται με την ακόλουθη άδεια:
Attribution-NonCommercial-NoDerivatives 4.0 Διεθνές