Σύγκριση των τεχνικών καθορισμού του πλήθους ομάδων σε σύνολα πολυδιάστατων δεδομένων
Comparison of techniques identifying the number of clusters present in multivariate datasets
View/ Open
Keywords
Συσταδοποίηση ; Κριτήρια αξιολόγησης αλγορίθμωνAbstract
Clustering analysis is a fundamental technique in data science, aiming to uncover inherent patterns and relationships within complex datasets. This MSc Thesis investigates and compares various evaluation criteria of clustering techniques on multidimensional datasets to identify the optimal number of clusters. Simulated data with known cluster structures are exploited to evaluate the stability and effectiveness of each method. Criteria such as the Silhouette measure and Calinski Harabasz are used to compare and suggest the optimal number of clusters. The findings of our numerical experimentation highlight the sensitivity of clustering outcomes to the choice of method, emphasizing the significance of selecting the appropriate techniques based on data characteristics.
The Thesis contributes valuable insights into suggesting and selecting the optimal number of clusters and highlighting the different characteristics of criterias, offering through examples some conclusions on this subject. Finally, we offer guidance for method selection and validation. Future research directions are suggested to explore hybrid approaches, and address challenges in large-scale data clustering.