Analyse des Représentations Latentes des Modèles de Text-To-Speech Neuronaux pour le Contrôle de la Synthèse Audio-Visuelle Expressive ; Analysis of Latent Representations of Neural Text-To-Speech Models for Expressive Audio-Visual Synthesis

Media type: Electronic Thesis; Text; E-Book

Title: Analyse des Représentations Latentes des Modèles de Text-To-Speech Neuronaux pour le Contrôle de la Synthèse Audio-Visuelle Expressive ; Analysis of Latent Representations of Neural Text-To-Speech Models for Expressive Audio-Visual Synthesis

Contributor: Lenglet, Martin [Author]

imprint: theses.fr, 2023-12-12

Language: English

Keywords: Explainable AI ; Speech synthesis ; Réseau de Neurones Profond ; Expressive speech ; Parole expressive ; Conversational agent ; Expliquabilité IA ; Synthèse vocale ; Deep Neural Network ; Agent conversationnel

Origination:

Footnote: Diese Datenquelle enthält auch Bestandsnachweise, die nicht zu einem Volltext führen.

Description: In recent years, deep neural architectures display groundbreaking performances in various speech processing area, including Text-To-Speech (TTS). Models have grown bigger, including more layers and millions of trainable parameters to achieve almost natural synthesis, at the expense of interpretability of computed intermediate representations, called embeddings. However, statistical learning performed by these neural models constitutes a valuable source of information about language. This presentation aims at openning this "black box" to explore intermediate embeddings computed by state-of-the-art TTS models. By identifying phonetic and acoustic features in model representations, the proposed methods help understanding how neural TTS are able to organize speech information on an unsupervised manner and provide new insights on phonetic regularities captured by statistical learning on massive data that are beyond human expertise. This work open the route toward designing more careful control architectures for neural TTS, without the need for additional data nor training process. These results led us to propose an auxiliary module for expressive synthesis called Local Style Tokens (LST), which models local variations in prosody with respect to the type of embeddings to bias. ; In recent years, deep neural architectures display groundbreaking performances in various speech processing area, including Text-To-Speech (TTS). Models have grown bigger, including more layers and millions of trainable parameters to achieve almost natural synthesis, at the expense of interpretability of computed intermediate representations, called embeddings. However, statistical learning performed by these neural models constitutes a valuable source of information about language. This presentation aims at openning this "black box" to explore intermediate embeddings computed by state-of-the-art TTS models. By identifying phonetic and acoustic features in model representations, the proposed methods help understanding how neural TTS are able ...

Access State: Open Access

Search in field:

Recently searched for: