Ομαδοποίηση και κατηγοριοποίηση δεδομένων σύντομων κειμένων

Μπάνος, Ιωάννης

Supervised and unsupervised learning for short text data

Master Thesis

Author

Μπάνος, Ιωάννης

Date

2026-03

Abstract

This thesis examines the classification and clustering of short text data with application to patient reviews of pharmaceutical treatments. The aim of the study is to analyze the sentiment of the reviews, as well as to detect underlying patterns associated with patient experiences and satisfaction. Initially, the theoretical background of machine learning, both supervised and unsupervised, is presented, along with the main methods of data mining and natural language processing. The dataset undergoes extensive text preprocessing and feature extraction via BoW, TF-IDF and Word2Vec for the best application of the algorithms. In the context of supervised learning, classifiers such as Naive Bayes, Logistic Regression, Ridge, LinearSVC and SGD are applied, while in unsupervised learning, the algorithms K-means and HDBSCAN are implemented for clustering and PCA and UMAP for dimensionality reduction and visualization. In both categories, the external and internal evaluation metrics of the algorithms are analyzed for more meaningful comparison. The results highlight the thematic units that dominate the reviews and confirm that certain classification and clustering methods achieve higher accuracy and consistency. The paper concludes that the combined use of NLP techniques and machine learning models is an effective tool for understanding complex patterns in short text data.

Postgraduate Studies Programme

Εφαρμοσμένη Στατιστική

Department

Σχολή Χρηματοοικονομικής και Στατιστικής. Τμήμα Στατιστικής και Ασφαλιστικής Επιστήμης

Number of pages

Language

Greek

URI

https://dione.lib.unipi.gr/xmlui/handle/unipi/19119

Collections

Τμήμα Στατιστικής και Ασφαλιστικής Επιστήμης

Show full item record

Except where otherwise noted, this item's license is described as
Αναφορά Δημιουργού-Μη Εμπορική Χρήση-Όχι Παράγωγα Έργα 3.0 Ελλάδα