A comparative analysis of data parallelism and model parallelism for deep learning-based text classification
Συγκριτική ανάλυση παραλληλισμού δεδομένων και παραλληλισμού μοντέλου για την ταξινόμηση κειμένου με τεχνικές βαθιάς μάθησης

Bachelor Dissertation
Author
Giagias, Dimitrios
Γιαγιάς, Δημήτριος
Date
2025-09Advisor
Venetis, IoannisΒενέτης, Ιωάννης
View/ Open
Keywords
Deep learning ; Parallel computing ; Data parallelism ; Model parallelism ; Text classification ; Distributed training ; Natural Language Processing (NLP)Abstract
The increasing computational demands of modern deep learning models for natural language processing have made parallel training strategies essential for practical implementation. While the theoretical foundations of parallel training are well-established, empirical comparisons of different approaches applied to text classification architectures remain limited. This thesis presents an experimental comparison of data parallelism and model parallelism across four representative deep learning architectures: Convolutional Neural Networks (CNNs), Long Short Term Memory networks (LSTMs), Gated Recurrent Units (GRUs), and Transformers.
An experimental framework was developed using PyTorch to evaluate three training strategies: sequential baseline training, data parallelism using DistributedDataParallel, and model parallelism through manual layer partitioning across two GPUs. All experiments were conducted on the AG News dataset, ensuring standardized evaluation conditions across different architectures and parallelization approaches. The experimental evaluation focused on two critical aspects: computational efficiency measured through training time and speedup analysis, and model quality assessed through standard classification metrics including accuracy, precision, recall, and F1-score.
Results demonstrated that Data Parallelism consistently delivered substantial speedups (1.30×-1.80×) across all architectures while maintaining or improving model accuracy—including a notable +4.58% gain for GRU due to reduced overfitting. In contrast, Model Parallelism provided only modest acceleration (1.04×–1.12×) and exhibited high sensitivity to hardware topology, with performance degrading when inter-GPU communication costs dominated.
The findings lead to a clear conclusion: Data Parallelism is the preferred strategy when models fit within a single device, offering strong throughput gains with minimal implementation cost, whereas Model Parallelism remains valuable primarily as a memory-scaling tool for architectures that exceed single-GPU capacity.


