Efficient processing of Top-k joins in MapReduce / Hadoop

Σάουκ, Μέη; Saouk, Mei

Master Thesis

Author

Σάουκ, Μέη

Saouk, Mei

Date

2016-02-26

Abstract

Top-k joins are widely used in the area of data analytics. One of the most popular frameworks for data analytics is MapReduce, especially its open source implementation in Apache Hadoop. However, due to certain limitations of the model, the processing of top-k joins on Hadoop MapReduce becomes inefficient for very large datasets. In particular, MapReduce processes the whole input even if the best k tuples can be produced by processing only a part of the input datasets. In addition to this, MapReduce does not provide a load balancing technique for the fair load distribution to the reducers. These two weaknesses make top-k join processing on MapReduce inefficient. In this thesis, we propose three algorithms to tackle the problem of early termination and load balancing. Our techniques are based on algorithms that use data synopses such as histograms. Our experimental evaluation proves the efficiency of our proposed algorithms in terms of execution time and resources used, for a number of factors such as the k value, the dataset size, the join selectivity and the data distribution.

Postgraduate Studies Programme

Ψηφιακά Συστήματα και Υπηρεσίες

Department

Σχολή Τεχνολογιών Πληροφορικής και Επικοινωνιών. Τμήμα Ψηφιακών Συστημάτων

Number of pages

Language

English

URI

https://dione.lib.unipi.gr/xmlui/handle/unipi/9956

Collections

Τμήμα Ψηφιακών Συστημάτων

Show full item record

Except where otherwise noted, this item's license is described as
Attribution-NonCommercial-NoDerivatives 4.0 Διεθνές