Monitoring and mining distributed data streams
Παρακολούθηση και εξόρυξη γνώσης από κατανεμημένα ρεύματα δεδομένων

View/ Open
Abstract
Many modern streaming applications, such as online analysis of financial, network, sensor and other forms of data are inherently distributed in nature. Due to the distributed nature of data production in the aforementioned scenarios, the major challenge confronted by algorithms dealing with their manipulation is to reduce communication.This happens because the central collection of data is not feasible in large-scale applications.An important query type that is of the essence in such applications involves a continuous check on the position of a given (arbitrarily complex) function f with respect to a posed threshold T. This monitoring demand may be explicitly placed at the core of applications mission or implicitly stand as an operational component. One approach to achieve the desired communication reduction is to decompose the monitoring problem into local constraints that can be disseminated to the geographically dispersed sites. According to that approach, each site in the network will then have to consult these constraints upon the local dataset is altered. Collecting the data centrally is only required when the local constraint of at least one site is violated [103].However, the decomposition of the central monitoring problem into a set of local constraints is not always effective. In fact, it may complicate the monitoring processes and uncontrollably sacrifice accuracy when functioning over generic network infrastructures where message losses, death or reorganization of nodes affects the network formation.A second approach of performing the monitoring is to allow continuous communication between the necessary network parties but attempt to reduce the bandwidth consumption by applying reduction techniques on the data under transmission. In that, we allow efficient derivation of answers by controllably compromising accuracy. Regarding the first approach, we focus on monitoring (non-linear) complex functions over distributed data streams. More precisely, in our work [42], we generalize the geometric monitoring approach initially presented in [103] by proposing the adoption of local predictors [22] to be used during the distributed tracking. As regards the second of the previously discussed approaches and propose an outlier detection framework,namely TACO [44, 45], that trades bandwidth for accuracy in a straightforward manner and supports various similarity metrics (monitored functions of interest).Eventually, we further elaborate on extensions of the rationales utilized in the previously mentioned approaches. We concentrate on trajectory data streams and perform distributed Representative Trajectory monitoring over a number of monitored objects utilizing the concept of predictors [42]. Additionally, we exploit the properties of the monitored similarity measures used in [44, 45], in the context of detecting movement pattern alterations over streaming movement data [116]