Αποδοτική επεξεργασία ερωτημάτων κατάταξης στο map / reduce

View/ Open
Abstract
The present Thesis aims to process efficiently ranked queries (also known as Top-K) by the Map/Reduce method. In applications that manage huge volumes of data, the execution of computations or Top-K queries must be carried out in a distributed way, as well as with parallel processing so that it will be quick and efficient. In order to achieve that, the Hadoop system and the Map/Reduce programming model were used in distributed environments. The major advantages of Hadoop in the development of distributed applications is the parallel data processing in a set of cluster nodes, as well as the capability to manage machine failures, while the system detects the tasks that have failed and reroutes them to other nodes of the cluster. In this way, the reliability is ensured in terms of software and it does not depend on the quality of the hardware. However, the most important shortcoming of the Map/Reduce in cases of ranked queries (Top-K) is that in order to extract the final result, it is obliged to read the whole amount of data, a procedure that is not efficient at all. This study, through the experimental part and the execution of three different algorithms, aims to show the disadvantages of the default operation of the Map/Reduce programming model in Top-K queries, as well as the recommended solution and the effective processing of such query types. Two of the major shortcomings that occur will be managed, namely the Early Termination and the Load Balancing.