Skyline query processing in Spatial Hadoop

Master Thesis
Author
Περτέσης, Δημήτριος
Date
2014-10-21View/ Open
Subject
Αρχιτεκτονική ηλεκτρονικών υπολογιστών ; Ηλεκτρονικοί επεξεργασία δεδομένων ; Παράλληλη επεξεργασία (Ηλεκτρονικοί υπολογιστές) -- Hadoop MapReduce ; Apache Hadoop (Computer file) ; Θεωρία γραφημάτων ; Parallel programming (Computer science) ; Ηλεκτρονικοί υπολογιστές -- Προγραμματισμός -- Hadoop MapReduce ; Ηλεκτρονικοί υπολογιστές -- ΠρογραμματισμόςAbstract
The MapReduce programming model allows us to process large data sets on a cluster of machines. A MapReduce job usually splits the input data set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. The most popular open-source implementation is Apache Hadoop. Recently, an extension to Apache Hadoop has been developed called SpatialHadoop. SpatialHadoop is designed to handle large data sets of spatial data. SpatialHadoop contains spatial built-in data types but you can define your own data types. Moreover, it supports a variety of spatial operations and indexes.
In this project, we developed two efficient skyline computation algorithms and implemented on SpatialHadoop. Also, we compared them with an algorithm proposed in «CG_Hadoop: Computational Geometry in MapReduce» paper. The object of this study is to implement algorithms that will be efficient in uniform, correlated and anti-correlated distributions of data. These algorithms should also be capable to work with indexed and non-indexed input files. In order to evaluate the efficiency of these three algorithms we ran a set of experiments in a cluster of 17 nodes.