Speaker Adaptive Text-to-Speech With Timbre-Normalized Vector-Quantized Feature

被引:2
作者
Du, Chenpeng [1 ]
Guo, Yiwei [1 ]
Chen, Xie [1 ]
Yu, Kai [1 ]
机构
[1] Shanghai Jiao Tong Univ, AI Inst, Dept Comp Sci & Engn, X LANCE Lab,MoE Key Lab of Artificial Intelligence, Shanghai 200240, Peoples R China
关键词
Speech synthesis; speaker adaptation; timbre normalization; vector quantization; PITCH;
D O I
10.1109/TASLP.2023.3308374
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Achieving high fidelity and speaker similarity in text-to-speech speaker adaptation with limited amount of data is a challenging task. Most existing methods only consider adapting to the timbre of the target speakers but fail to capture their speaking styles from little data. In this work, we propose a novel TTS system, TN-VQTTS, which leverages timbre-normalized vector-quantized (TN-VQ) acoustic feature for speaker adaptation with little data. With the TN-VQ feature, speaking style and timbre can be effectively decomposed and controlled by the acoustic model and the vocoder separately of VQTTS. Such decomposition enables us to finely mimic both the two characteristics of the target speaker in adaptation with little data. Specifically, we first reduce the dimensionality of self-supervised VQ acoustic feature via PCA and normalize its timbre with a normalizing flow model. The feature is then quantized with k-means and used as the TN-VQ feature for a multi-speaker VQ-TTS system. Furthermore, we optimize timbre-independent style embeddings of the training speakers jointly with the acoustic model and store them in a lookup table. The embedding table later serves as a selectable codebook or a group of basis for representing the style of unseen speakers. Our experiments on LibriTTS dataset first show that the proposed model architecture for VQ feature achieves better performance in multi-speaker text-to-speech synthesis than several existing methods. We also find that the reconstruction performance and the naturalness are almost unchanged after applying timbre normalization and k-means quantization. Finally, we show that TN-VQTTS achieves better performance on speaker similarity in adaptation than both speaker embedding based adaptation method and fine-tuning based baseline AdaSpeech.
引用
收藏
页码:3446 / 3456
页数:11
相关论文
共 58 条
  • [1] Arik SÖ, 2017, ADV NEUR IN, V30
  • [2] Arik SÖ, 2018, ADV NEUR IN, V31
  • [3] Baevski A, 2020, ADV NEUR IN, V33
  • [4] Baevski A, 2020, Arxiv, DOI arXiv:1910.05453
  • [5] Behre C., 2017, The relationship between fundamental frequency variation and articulation in healthy speech production
  • [6] Casanova E, 2022, PR MACH LEARN RES
  • [7] SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model
    Casanova, Edresson
    Shulby, Christopher
    Golge, Eren
    Muller, Nicolas Michael
    de Oliveira, Frederico Santos
    Candido Junior, Arnaldo
    Soares, Anderson da Silva
    Aluisio, Sandra Maria
    Ponti, Moacir Antonelli
    [J]. INTERSPEECH 2021, 2021, : 3645 - 3649
  • [8] Neural Fusion for Voice Cloning
    Chen, Bo
    Du, Chenpeng
    Yu, Kai
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1993 - 2001
  • [9] Chen M., 2021, P INT C LEARN REPR
  • [10] Cross-lingual, Multi-speaker Text-To-Speech Synthesis Using Neural Speaker Embedding
    Chen, Mengnan
    Chen, Minchuan
    Liang, Shuang
    Ma, Jun
    Chen, Lei
    Wang, Shaojun
    Xiao, Jing
    [J]. INTERSPEECH 2019, 2019, : 2105 - 2109