Automatic music captioning

Rentoula, Vasiliki; Ρέντουλα, Βασιλική

Αυτόματη περιγραφή μουσικής

Master Thesis

Author

Rentoula, Vasiliki

Ρέντουλα, Βασιλική

Date

2025-05

Abstract

This work focuses on the application of Deep Learning techniques for Automatic Audio Captioning, particularly focusing on music. Specifically, this study reproduces and benchmarks state-of-the-art music captioning models that integrate sequence to sequence models, following insights from the DCASE 2023 Task 6A challenges. Additionally, it investigates self-supervised learning techniques using convolutional and transformer-based autoencoders, where pretrained masked audio representations—learned by predicting missing parts of audio signals—are transferred to the captioning model. To further enhance model performance, various masking strategies, such as unstructured, time, frequency, and combined time-frequency masking, were explored to evaluate their impact on caption quality. The study also examines the role of music tagging, evaluating how genre and instrument labels affects the caption generation. Through a comparative analysis of training configurations, the effectiveness of pretrained versus randomly initialized encoders is assessed using the multiple datasets. By addressing these objectives, this research aims to contribute to the development of improved music description captions. Also, the code is available at https://github. com/CuteQuacky/Thesis_Music_Captioning

Postgraduate Studies Programme

Τεχνητή Νοημοσύνη - Artificial Intelligence

Department

Σχολή Τεχνολογιών Πληροφορικής και Επικοινωνιών. Τμήμα Ψηφιακών Συστημάτων

Corporate Department

National Center of Scientific Research "Demokritos"

Number of pages

Language

English

URI

https://dione.lib.unipi.gr/xmlui/handle/unipi/17816

Collections

Τμήμα Ψηφιακών Συστημάτων

Show full item record

Except where otherwise noted, this item's license is described as
Αναφορά Δημιουργού-Μη Εμπορική Χρήση-Όχι Παράγωγα Έργα 3.0 Ελλάδα