Fake news detection in a headline-only setting : a comparative study of machine learning and deep learning models on the GossipCop dataset
Ανίχνευση ψευδών ειδήσεων από τίτλους : συγκριτική μελέτη μοντέλων μηχανικής και βαθιάς μάθησης στο σύνολο δεδομένων GossipCop
Master Thesis
Συγγραφέας
Sotiropoulos, Dionysios
Σωτηρόπουλος, Διονύσιος
Ημερομηνία
2026-04Επιβλέπων
Filippakis, MichaelΦιλιππάκης, Μιχαήλ
Προβολή/ Άνοιγμα
Λέξεις κλειδιά
Fake news detection ; Headline-based classification ; Machine learning ; Deep learning ; Natural language processing ; NLP ; Transformer models ; Hybrid models ; Logistic regression ; Linear SVM ; Random forest ; XGBoost ; CNN ; BiLSTM ; DistilBERT ; Supervised binary classification ; GossipCop dataset ; FakeNewsNetΠερίληψη
The rapid spread of misinformation through digital platforms has made fake news detection an important problem in contemporary data-driven research. This thesis investigates the effectiveness of Machine Learning and Deep Learning approaches for fake news detection in a constrained headline-only setting, where the available textual information is limited and the dataset is significantly imbalanced. The study is based on the GossipCop subset of FakeNewsNet and focuses exclusively on headline text in order to evaluate model behavior under controlled content-based conditions.
A comparative experimental framework was developed including four Machine Learning models, namely Logistic Regression, Linear Support Vector Classification, Random Forest, and XGBoost, as well as four Deep Learning approaches: Convolutional Neural Networks, Bidirectional Long Short-Term Memory networks, DistilBERT, and a hybrid DistilBERT + XGBoost model. For the Machine Learning models, headlines were represented using 300-dimensional Doc2Vec embeddings and evaluated with stratified 10-fold cross-validation, while class imbalance was handled through SMOTE applied only to the training folds. For the Deep Learning models, tokenized or transformer-based headline representations were used within a holdout evaluation framework, with class weighting employed where appropriate.
The results show a clear performance gap between the two model families. The Machine Learning baselines, especially Logistic Regression and LinearSVC, exhibited weak discriminative performance, while Random Forest and XGBoost improved overall accuracy but remained ineffective in recovering fake news instances. In contrast, the Deep Learning models achieved substantially stronger and more balanced results. DistilBERT provided the best overall balance across ROC-AUC, Cohen’s Kappa, fake-class F1-score, and weighted F1-score. The hybrid DistilBERT + XGBoost model achieved the highest overall accuracy, although at the cost of lower fake recall, whereas BiLSTM demonstrated the strongest ability to recover fake news instances.
The findings indicate that transformer-based and other Deep Learning approaches are more suitable than traditional Machine Learning methods for headline-based fake news detection. At the same time, the study highlights the strong influence of class imbalance and the inherent limitations of headline-only datasets, which restrict contextual depth and constrain model performance. Overall, the thesis shows that fake news detection based solely on headlines remains a relevant but inherently difficult classification problem.


