Deep neural networks on text-to-speech synthesis

Master Thesis
Author
Tsagkaratos, Panagiotis
Τσαγκαράτος, Παναγιώτης
Date
2022View/ Open
Keywords
Μηχανική μάθηση ; Ανάλυση δεδομένων ; Machine learning ; Data analysisAbstract
Text-to-speech (TTS) synthesis is the automatic conversion of written text to spoken
language. TTS systems play an important role in natural human-computer interaction.
Concatenative speech synthesis and statistical parametric speech synthesis were the
prominent methods used for decades. In the era of Deep learning, TTS systems have
dramatically improved the quality of synthetic speech. The aim of this work was the
comparison of [1] with the latest development in the field of TTS and suggesting
improvements. The neural network architecture of Tacotron-2 is used for speech synthesis
directly from text. The system is composed of a recurrent sequence-to-sequence feature
prediction network that maps character embeddings to acoustic features, followed by a
modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from
the predicted acoustic features. Developing TTS systems for any given language is a
significant challenge and requires large amount of high quality acoustic recordings.