Αυτόματη περιγραφή οπτικοακουστικών σκηνών με χρήση βαθέων νευρωνικών δικτύων

Βαρτιάν, Ασαντούρ

Automatic description of audiovisual scenes using deep neural networks

Master Thesis

Author

Βαρτιάν, Ασαντούρ

Date

2023-12

Abstract

Machine Learning is a rapidly growing field of informatics, capable of providing solutions to demanding problems of increasing complexity. In that context, the goal of this thesis is to build a system for the purposed of automatic video scene description using machine learning pipeline. To this end, a video signal is treated as a sequence of images and each image is fed as input to a CLIP architecture which generated an image description. CLIP is an open-source embedding trained to associate image with text. At a next step, the sequence of generated descriptions is concatenated and it is given as input to a transformer model which produces the final description of the video scene. In order to obtain better results at this second processing stage, we re-trained and fine-tuned the BART and Pegasus transformer models using the Large Scale Movie Description Challenge Dataset (LSMDC). The performance of the the proposed pipeline was assessed using various established metrics.

Postgraduate Studies Programme

Κυβερνοασφάλεια και Επιστήμη Δεδομένων

Department

Σχολή Τεχνολογιών Πληροφορικής και Επικοινωνιών. Τμήμα Πληροφορικής

Number of pages

Language

Greek

URI

https://dione.lib.unipi.gr/xmlui/handle/unipi/16169
http://dx.doi.org/10.26267/unipi_dione/3591

Collections

Τμήμα Πληροφορικής

Show full item record

Except where otherwise noted, this item's license is described as
Αναφορά Δημιουργού-Μη Εμπορική Χρήση-Όχι Παράγωγα Έργα 3.0 Ελλάδα