Συσταδοποίηση ροών σύνθετων δεδομένων
Clustering complex data streams
Κοντονής, Βασίλειος Ν.
Streaming data clustering is a continuous growing research area through various aspects such as big data mining, clustering and analysis algorithms and more, where the need for high volume data processing is required. This thesis’s purpose is to evaluate the capabilities provided by modern and widely known libraries (MOA) and tools (R programming language) to apply clustering algorithms on streams of high volume of complex data. Having available a dataset with a big amount of records regarding the trajectories of many taxis in Beijing, the important contribution of this project is the implementation of this particular architecture and application. The result of the above mentioned application using the specific dataset that is available has revealed the prospect that is provided to the research and the commercial sector by utilizing the MOA library through the R programming environment. DenStream, a density based clustering algorithm, has been applied, evaluated and also compared to the CluStream algorithm, which has also been applied and evaluated as part of this thesis. DenStream algorithm has given better results than the CluStream algorithm in terms of clustering quality, based on specific evaluation metrics used. Furthermore, it turns out that the initialization of the parameters of each algorithm is empirical and it is defined by the analyst based on expertise and the domain knowledge. Eventually, in this thesis it is considered if the suggested architecture could be a point of reference or a subset of a statistical analysis system, which could process data streams, group them into clusters and evaluate the results in real time.