Speaker Adaptive Text-to-Speech With Timbre-Normalized Vector-Quantized Feature

被引：2

作者：

Du, Chenpeng ^{[1
]}

Guo, Yiwei ^{[1
]}

Chen, Xie ^{[1
]}

Yu, Kai ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, AI Inst, Dept Comp Sci & Engn, X LANCE Lab,MoE Key Lab of Artificial Intelligence, Shanghai 200240, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2023年 / 31卷

关键词：

Speech synthesis; speaker adaptation; timbre normalization; vector quantization; PITCH;

D O I：

10.1109/TASLP.2023.3308374

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Achieving high fidelity and speaker similarity in text-to-speech speaker adaptation with limited amount of data is a challenging task. Most existing methods only consider adapting to the timbre of the target speakers but fail to capture their speaking styles from little data. In this work, we propose a novel TTS system, TN-VQTTS, which leverages timbre-normalized vector-quantized (TN-VQ) acoustic feature for speaker adaptation with little data. With the TN-VQ feature, speaking style and timbre can be effectively decomposed and controlled by the acoustic model and the vocoder separately of VQTTS. Such decomposition enables us to finely mimic both the two characteristics of the target speaker in adaptation with little data. Specifically, we first reduce the dimensionality of self-supervised VQ acoustic feature via PCA and normalize its timbre with a normalizing flow model. The feature is then quantized with k-means and used as the TN-VQ feature for a multi-speaker VQ-TTS system. Furthermore, we optimize timbre-independent style embeddings of the training speakers jointly with the acoustic model and store them in a lookup table. The embedding table later serves as a selectable codebook or a group of basis for representing the style of unseen speakers. Our experiments on LibriTTS dataset first show that the proposed model architecture for VQ feature achieves better performance in multi-speaker text-to-speech synthesis than several existing methods. We also find that the reconstruction performance and the naturalness are almost unchanged after applying timbre normalization and k-means quantization. Finally, we show that TN-VQTTS achieves better performance on speaker similarity in adaptation than both speaker embedding based adaptation method and fine-tuning based baseline AdaSpeech.

引用

页码：3446 / 3456

页数：11

共 58 条

[1] Arik SÖ, 2017, ADV NEUR IN, V30
[2] Arik SÖ, 2018, ADV NEUR IN, V31
[3] Baevski A, 2020, ADV NEUR IN, V33
[4] Baevski A, 2020, Arxiv, DOI arXiv:1910.05453
[5] Behre C., 2017, The relationship between fundamental frequency variation and articulation in healthy speech production
[6] Casanova E, 2022, PR MACH LEARN RES
[7] SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model
Casanova, Edresson
Shulby, Christopher
Golge, Eren
Muller, Nicolas Michael
de Oliveira, Frederico Santos
Candido Junior, Arnaldo
Soares, Anderson da Silva
Aluisio, Sandra Maria
Ponti, Moacir Antonelli
[J]. INTERSPEECH 2021, 2021, : 3645 - 3649
[8] Neural Fusion for Voice Cloning
Chen, Bo
Du, Chenpeng
Yu, Kai
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1993 - 2001
[9] Chen M., 2021, P INT C LEARN REPR
[10] Cross-lingual, Multi-speaker Text-To-Speech Synthesis Using Neural Speaker Embedding
Chen, Mengnan
Chen, Minchuan
Liang, Shuang
Ma, Jun
Chen, Lei
Wang, Shaojun
Xiao, Jing
[J]. INTERSPEECH 2019, 2019, : 2105 - 2109

← 1 2 3 4 5 6 →