A cascaded BERT model for sentiment classification
Ένα πολυσταδιακό μοντέλο BERT για την ανάλυση συναισθήματος

Bachelor Dissertation
Author
Papadakis, Ioannis
Παπαδάκης, Ιωάννης
Date
2025-09View/ Open
Keywords
BERT ; Sentiment analysis ; Emotion recognition ; Transformers ; Pre-training ; 4 Fine-tuning ; From Scratch Training 5 Masked Language Modeling ; Hyperparameter optimizationAbstract
This thesis presents the development and evaluation of a BERT-based (Bidirectional
Encoder Representations from Transformers) model for multi-class emotion
recognition in text. The distinctive contribution of this work is the comprehensive
training of BERT from scratch—including both pre-training and fine-tuning phases—
rather than relying on publicly available pre-trained models. This approach provides
empirical insights into how language models acquire linguistic knowledge and learn
task-specific patterns from data.
.
The approach follows a two-stage process as pre-traning and fine-tuning. Pre-
training To pre-train the model, a full BERT-base (12 Transformer layers, and ~110M
parameters) is independently trained on English Wikipedia data set(version
20231101) in our experiment over MLM objectives. Subsequently, the model was
fine-tuned for emotion classification on a concatenated dataset of about 32000
samples from Twitter Multi-class Sentiment and academic Emotion datasets which
had been collected as part of this work; these datasets were generated for following
six emotions: Joy, Sadness, Anger, Fear, Love and Surprise.
The activity was developed through four phases of experimentation, and each had
key lessons. The first stage was the direct supervised training, which fine-tuned a
randomly initialized BERT model for sentiment classification. This method
catastrophically failed for this dataset reaching only 31% accuracy due to model
collapse since the model was always predicting majority class (sadness). This failure
constituted solid empirical support for the principle of linguistic pre-task
specialization.
The second step was proof-of-concept validation with a two-step approach, a smaller
model (6-layer for BERT) and the WikiText-2 corpus with CPU environment. This
resulted in about 89% accuracy, meaning that the two-phase approach is valid and
can be justified to be deployed on a full scale.
In the third stage, we aimed to build infrastructure and move on to an NVIDIA L40S
48GB GPU server. This uncovered a number of technical challenges including OS
issues (needing to move from an incompatible CentOS Stream 9 To Ubuntu 22.04
LTS), Python compilation problems, and library depedency problems. These were
systematically addressed through the creation of a full suite of DevOps automation
for deployment, sync, environment setup scripts.
The full-scale fine-tuning was conducted in the 4th and last stage, which is training
with a complete 12 layers BERT model on entire English Wikipedia dataset including
the Optuna-based hyperparameter optimization by Hugging Face Trainer. The pre-
training phase lasted about 2.5 hours on the NVIDIA L40S GPU. The tuning phase
was conducted using the merged Twitter-Emotion dataset and automatic
hyperparameter search over 10 trials were done to find out the most appropriate
learning rate and batch size settings.
The best model achieved an average accuracy of 91.3% on the test set, and
demonstrated promising results for joy (F1-score: 0.94), sadness (F1-score: 0.95)
and anger (F1-score: 0.91). The fear (F1-score: 0.88), love (F1-score: 0.83) and
surprise (F1-score: 0.72) classes exhibit relatively lower performance because of the
smaller amount of training data in these categories, plus significant semantic overlap
with other emotion categories. The surprise category, as the most under-represented
one in the training data (~3% of samples) was the hardest.
This paper shows the importance of two-phase method in training language model
from scratch, technical problems participants have solved when seeting up GPU
infrastructure and the usefulness of BERT architecture for emotion classification.
Also, it also demonstrates the importance of grid search and methodical approach in
deep learning projects. The project effectively demonstrates that training a state-of-
the-art language model from scratch is possible provided we follow the scientific
mindset, have strong enough computational resources, and act high scale DevOps.


