Sign language recognition in video sequences of single words

Master Thesis
Author
Nikolopoulos, Konstantinos
Νικολόπουλος, Κωνσταντίνος
Date
2024-02View/ Open
Keywords
SLR ; CNN ; Gesture recognition ; Sign languageAbstract
The goal of this thesis is to explore the challenges of the Sign Language Recognition (SLR) problem, and suggest an accurate Machine Learning (ML) model for SLR in video sequences of single words. SLR holds significant importance as it addresses the communication barriers between individuals with hearing impairments or speech impediments and the general population. However, the existing methods face various constraints. Many proposed solutions rely on image-based recognition, while others require the use of multi-colored/sensor-based gloves or specific cameras. This study proposes a straightforward system that does not require specific accessories, yet remains highly resilient to variations in test subjects such as skin tone, gender, and body size. This signer-independent system consists of four main steps. Firstly, a dataset was gathered for three target corpus sizes (20, 100 and 300 words) that is both balanced and with high variability. For that reason, the "WLASL: A large-scale dataset for Word-Level American Sign Language" was selected. Then arm and hand features were extracted from the videos using real-time optimized Computer Vision libraries, frameworks and Machine Learning (ML) solutions. The tools of choice are mainly Mediapipe and OpenCV. Afterwards, data augmentation and dynamic time wrapping techniques were applied to the data to improve performance and invariance. Finaly, a selection of Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN) and combinations of the two were trained. The experiments showed that the proposed approach wields excellent results especially for the CNN models, reaching up to 98% accuracy for a corpus size of 100 words or 97% for a corpus size of 300 words.