Show simple item record

dc.contributor.advisorΠρέντζα, Ανδριάνα
dc.contributor.advisorPeris, Ricardo Jimenez
dc.contributor.authorΜαυρογιώργου, Αργυρώ
dc.date.accessioned2015-09-21T14:22:20Z
dc.date.available2015-09-21T14:22:20Z
dc.date.issued2015-02-13
dc.identifier.urihttp://dione.lib.unipi.gr/xmlui/handle/unipi/7711
dc.format.extent119el
dc.language.isoenel
dc.publisherΠανεπιστήμιο Πειραιώςel
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 Διεθνές*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/*
dc.subjectCluster analysisel
dc.subjectData miningel
dc.subjectΕξόρυξη δεδομένωνel
dc.subjectDatabasesel
dc.subjectΒάσεις δεδομένωνel
dc.subjectΗλεκτρονική επεξεργασία δεδομένωνel
dc.subjectElectronic data processingel
dc.titleParallel k-means clustering on top of an olap databaseel
dc.typeMaster Thesisel
dc.contributor.departmentΣχολή Τεχνολογιών Πληροφορικής και Επικοινωνιών. Τμήμα Ψηφιακών Συστημάτωνel
dc.identifier.call001.644'2 ΜΑΥel
dc.description.abstractENDriven by advances in data collection and storage, increasingly large and high dimensional datasets are being stored. Without special tools, human analysts can no longer make sense of such massive volumes of data. As a consequence, intelligent Data Mining techniques are being developed either to semi-automate or totally automate the process of data mining. Clustering is an essential step in this process, as it reveals natural structures and identifies interesting patterns in the underlying data, by grouping high dimensional data together. In order to process this data, there have been proposed several clustering algorithms, such as the K-Means algorithm, which is well-known for its efficiency in clustering large datasets. At the same time, while the Data Mining techniques are necessary for managing all these continuously growing data, the use of the distributed systems has become a major issue. Without them, the computing environments would not be able to manage such large-scales of data with an overwhelmingly high performance. Hence, several distributed frameworks have been created, aiming both to use the tolerance and the efficiency of the distributed systems and the priceless abilities of the Data Mining techniques. This Thesis investigates the distributed framework of Apache Hadoop, a framework for large-scale data processing, where an entire ecosystem of self-standing projects has been built. Mahout is one of these projects, created to serve the various Data Mining issues. Thus, the Thesis studies about how the K-Means algorithm is used and implemented so as to process huge amounts of data using Mahout's machine learning libraries in combination with the Hadoop Distributed File System and the MapReduce key-value pair paradigm. Additionally, in that Thesis it was developed a prototype project using a new parallel version of K-Means, so as to solve our main problem of performing parallel clustering on top of an OLAP distributed database. That algorithm uses Mahout's libraries to achieve scalability, an OLAP database for data storing purposes, and a new key-value pair concept for processing in parallel this data. This key-value pair concept was based on the MapReduce paradigm, but working in a different way. In the new concept, there are three different phases in the whole procedure (Map-Reduce-FinalReduce). Firstly, during the Map phase there are created numerous threads running in parallel different Map jobs, producing different key- value pairs. Then, in the Reduce phase each thread continues its corresponding Reduce job, processing its Map's output key-value pairs, without waiting for the other threads to finish their Map tasks. Finally, in the FinalReduce phase all the output key-value pairs of the Reduce phase of each thread are merged and the final key-value pairs are created. Finally, summarized preliminary tests with experimental results of these implementations were stated, in order to verify that the developed prototype project is working properly, producing complete and reliable results.en
dc.corporate.nameUniversidad Politecnica de Madrid. Escuela Tecnica Superior de Ingenieros Informaticosel
dc.contributor.masterΨηφιακά Συστήματα και Υπηρεσίεςel


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

Attribution-NonCommercial-NoDerivatives 4.0 Διεθνές
Except where otherwise noted, this item's license is described as
Attribution-NonCommercial-NoDerivatives 4.0 Διεθνές

Βιβλιοθήκη Πανεπιστημίου Πειραιώς
Contact Us
Send Feedback
Created by ELiDOC
Η δημιουργία κι ο εμπλουτισμός του Ιδρυματικού Αποθετηρίου "Διώνη", έγιναν στο πλαίσιο του Έργου «Υπηρεσία Ιδρυματικού Αποθετηρίου και Ψηφιακής Βιβλιοθήκης» της πράξης «Ψηφιακές υπηρεσίες ανοιχτής πρόσβασης της βιβλιοθήκης του Πανεπιστημίου Πειραιώς»