Αυτόματη περιγραφή οπτικοακουστικών σκηνών με χρήση βαθέων νευρωνικών δικτύων
Automatic description of audiovisual scenes using deep neural networks

View/ Open
Keywords
CLIP ; Video summary ; Machine learning ; TransformersAbstract
Machine Learning is a rapidly growing field of informatics, capable of providing solutions to demanding problems of increasing complexity. In that context, the goal of this thesis is to build a system for the purposed of automatic video scene description using machine learning pipeline. To this end, a video signal is treated as a sequence of images and each image is fed as input to a CLIP architecture which generated an image description. CLIP is an open-source embedding trained to associate image with text. At a next step, the sequence of generated descriptions is concatenated and it is given as input to a transformer model which produces the final description of the video scene. In order to obtain better results at this second processing stage, we re-trained and fine-tuned the BART and Pegasus transformer models using the Large Scale Movie Description Challenge Dataset (LSMDC). The performance of the the proposed pipeline was assessed using various established metrics.