One-Shot Speaker Adaptation Based on Initialization by Generative Adversarial Networks for TTS

被引：1

作者：

Lee, Jaeuk ^{[1
]}

Chang, Joon-Hyuk ^{[1
]}

机构：

[1] Hanyang Univ, Dept Elect Engn, Seoul, South Korea

来源：

INTERSPEECH 2022 | 2022年

关键词：

speaker adaptation; voice cloning; multi-speaker;

D O I：

10.21437/Interspeech.2022-934

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speaker adaptation for personalizing text-to-speech (TTS) has become increasingly important. Herein, we propose a novel adaptation using a few seconds of data obtained from an unseen speaker. We first use a speaker embedding lookup table to train a multi-speaker TTS model, wherein each speaker embedding in the lookup table contains information representing a speaker's timbre. We propose an initial embedding predictor that extracts initial embedding suitable for the adaptation of unseen speakers. We use trained speaker embeddings to train the initial embedding predictor. Further, adversarial training is applied to improve the performance. After adversarial training, the initial embedding predictor infers the unseen speaker's initial embedding, and it is fine-tuned. As the initial embedding contains timbre information of the unseen speaker, adaptation is achieved faster and with less data than with conventional methods. We validate the performance with a mean opinion score (MOS) and demonstrate that adaptation is feasible with only 5 s of data.

引用

页码：2978 / 2982

页数：5

共 10 条

[1] BOFFIN TTS: FEW-SHOT SPEAKER ADAPTATION BY BAYESIAN OPTIMIZATION
Moss, Henry B.
Aggarwal, Vatsal
Prateek, Nishant
Gonzalez, Javier
Barra-Chicote, Roberto
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7639 - 7643
[2] GC-TTS: Few-shot Speaker Adaptation with Geometric Constraints
Kim, Ji-Hoon
Lee, Sang-Hoon
Lee, Ji-Hyun
Jung, Hong-Gyu
Lee, Seong-Whan
2021 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2021, : 1172 - 1177
[3] UNSUPERVISED SPEAKER ADAPTATION FOR DNN-BASED TTS SYNTHESIS
Fan, Yuchen
Qian, Yao
Soong, Frank K.
He, Lei
2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5135 - 5139
[4] Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS
Udagawa, Kenta
Saito, Yuki
Saruwatari, Hiroshi
INTERSPEECH 2022, 2022, : 2968 - 2972
[5] Formant-based Frequency Warping for Improving Speaker Adaptation in HMM TTS
Zhuang, Xin
Qian, Yao
Soong, Frank
Wu, Yijian
Zhang, Bo
11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 817 - +
[6] MODULE COMPARISON OF TRANSFORMER-TTS FOR SPEAKER ADAPTATION BASED ON FINE-TUNING
Inoue, Katsuki
Hara, Sunao
Abe, Masanobu
2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 826 - 830
[7] Speaker Adaptation using Relevance Vector Regression for HMM-based Expressive TTS
Hong, Doo Hwa
Lee, Joun Yeop
Jang, Se Young
Kim, Nam Soo
16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 1216 - 1220
[8] One-shot emotional voice conversion based on feature separation
Lu, Wenhuan
Zhao, Xinyue
Guo, Na
Li, Yongwei
Wei, Jianguo
Tao, Jianhua
Dang, Jianwu
SPEECH COMMUNICATION, 2022, 143 : 1 - 9
[9] LINEAR NETWORKS BASED SPEAKER ADAPTATION FOR SPEECH SYNTHESIS
Huang, Zhiying
Lu, Heng
Lei, Ming
Yan, Zhijie
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5319 - 5323
[10] I-VECTOR-BASED SPEAKER ADAPTATION OF DEEP NEURAL NETWORKS FOR FRENCH BROADCAST AUDIO TRANSCRIPTION
Gupta, Vishwa
Kenny, Patrick
Ouellet, Pierre
Stafylakis, Themos
2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,

← 1 →