One-Shot Speaker Adaptation Based on Initialization by Generative Adversarial Networks for TTS

被引:1
|
作者
Lee, Jaeuk [1 ]
Chang, Joon-Hyuk [1 ]
机构
[1] Hanyang Univ, Dept Elect Engn, Seoul, South Korea
来源
INTERSPEECH 2022 | 2022年
关键词
speaker adaptation; voice cloning; multi-speaker;
D O I
10.21437/Interspeech.2022-934
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speaker adaptation for personalizing text-to-speech (TTS) has become increasingly important. Herein, we propose a novel adaptation using a few seconds of data obtained from an unseen speaker. We first use a speaker embedding lookup table to train a multi-speaker TTS model, wherein each speaker embedding in the lookup table contains information representing a speaker's timbre. We propose an initial embedding predictor that extracts initial embedding suitable for the adaptation of unseen speakers. We use trained speaker embeddings to train the initial embedding predictor. Further, adversarial training is applied to improve the performance. After adversarial training, the initial embedding predictor infers the unseen speaker's initial embedding, and it is fine-tuned. As the initial embedding contains timbre information of the unseen speaker, adaptation is achieved faster and with less data than with conventional methods. We validate the performance with a mean opinion score (MOS) and demonstrate that adaptation is feasible with only 5 s of data.
引用
收藏
页码:2978 / 2982
页数:5
相关论文
共 10 条
  • [1] BOFFIN TTS: FEW-SHOT SPEAKER ADAPTATION BY BAYESIAN OPTIMIZATION
    Moss, Henry B.
    Aggarwal, Vatsal
    Prateek, Nishant
    Gonzalez, Javier
    Barra-Chicote, Roberto
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7639 - 7643
  • [2] GC-TTS: Few-shot Speaker Adaptation with Geometric Constraints
    Kim, Ji-Hoon
    Lee, Sang-Hoon
    Lee, Ji-Hyun
    Jung, Hong-Gyu
    Lee, Seong-Whan
    2021 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2021, : 1172 - 1177
  • [3] UNSUPERVISED SPEAKER ADAPTATION FOR DNN-BASED TTS SYNTHESIS
    Fan, Yuchen
    Qian, Yao
    Soong, Frank K.
    He, Lei
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5135 - 5139
  • [4] Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS
    Udagawa, Kenta
    Saito, Yuki
    Saruwatari, Hiroshi
    INTERSPEECH 2022, 2022, : 2968 - 2972
  • [5] Formant-based Frequency Warping for Improving Speaker Adaptation in HMM TTS
    Zhuang, Xin
    Qian, Yao
    Soong, Frank
    Wu, Yijian
    Zhang, Bo
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 817 - +
  • [6] MODULE COMPARISON OF TRANSFORMER-TTS FOR SPEAKER ADAPTATION BASED ON FINE-TUNING
    Inoue, Katsuki
    Hara, Sunao
    Abe, Masanobu
    2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 826 - 830
  • [7] Speaker Adaptation using Relevance Vector Regression for HMM-based Expressive TTS
    Hong, Doo Hwa
    Lee, Joun Yeop
    Jang, Se Young
    Kim, Nam Soo
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 1216 - 1220
  • [8] One-shot emotional voice conversion based on feature separation
    Lu, Wenhuan
    Zhao, Xinyue
    Guo, Na
    Li, Yongwei
    Wei, Jianguo
    Tao, Jianhua
    Dang, Jianwu
    SPEECH COMMUNICATION, 2022, 143 : 1 - 9
  • [9] LINEAR NETWORKS BASED SPEAKER ADAPTATION FOR SPEECH SYNTHESIS
    Huang, Zhiying
    Lu, Heng
    Lei, Ming
    Yan, Zhijie
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5319 - 5323
  • [10] I-VECTOR-BASED SPEAKER ADAPTATION OF DEEP NEURAL NETWORKS FOR FRENCH BROADCAST AUDIO TRANSCRIPTION
    Gupta, Vishwa
    Kenny, Patrick
    Ouellet, Pierre
    Stafylakis, Themos
    2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,