SC-CNN: Effective Speaker Conditioning Method for Zero-Shot Multi-Speaker Text-to-Speech Systems

被引:6
作者
Yoon, Hyungchan [1 ]
Kim, Changhwan [1 ]
Um, Seyun [1 ]
Yoon, Hyun-Wook [2 ]
Kang, Hong-Goo [1 ]
机构
[1] Yonsei Univ, Dept Elect & Elect Engn, Seoul 03722, South Korea
[2] Naver Corp, Clova Voice, Seongnam 13561, South Korea
关键词
Generalization; text-to-speech; zero-shot; multi-speaker; style transfer;
D O I
10.1109/LSP.2023.3277786
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
This letter proposes an effective speaker-conditioning method that is applicable to zero-shot multi-speaker text-to-speech (ZSM-TTS) systems. Based on the inductive bias in the speech generation task, in which local context information in text/phoneme sequences heavily affect the speaker characteristics of the output speech, we propose a Speaker-Conditional Convolutional Neural Network (SC-CNN) for the ZSM-TTS task. SC-CNN first predicts convolutional kernels from each learned speaker embedding, then applies 1-D convolutions to phoneme sequences with the predicted kernels. It utilizes the aforementioned inductive bias and effectively models the characteristic of speech by providing the speaker-specific local context in phonetic domain. We also build both FastSpeech2 and VITS-based ZSM-TTS systems to verify its superiority over conventional speaker conditioning methods. The results confirm that the models with SC-CNN outperform the recent ZSM-TTS models in terms of both subjective and objective measurements.
引用
收藏
页码:593 / 597
页数:5
相关论文
共 34 条
[1]  
[Anonymous], 2016, ARXIV PREPR ARXIV160, DOI DOI 10.48550/ARXIV.1609.03499
[2]  
Arik SÖ, 2017, ADV NEUR IN, V30
[3]  
Casanova E, 2022, PR MACH LEARN RES
[4]  
Chen M., 2020, PROC INT C LEARN REP
[5]   MultiSpeech: Multi-Speaker Text to Speech with Transformer [J].
Chen, Mingjian ;
Tan, Xu ;
Ren, Yi ;
Xu, Jin ;
Sun, Hao ;
Zhao, Sheng ;
Qin, Tao .
INTERSPEECH 2020, 2020, :4024-4028
[6]   INVESTIGATING ON INCORPORATING PRETRAINED AND LEARNABLE SPEAKER REPRESENTATIONS FOR MULTI-SPEAKER MULTI-STYLE TEXT-TO-SPEECH [J].
Chien, Chung-Ming ;
Lin, Jheng-Hao ;
Huang, Chien-yu ;
Hsu, Po-chun ;
Lee, Hung-yi .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :8588-8592
[7]   SNAC: Speaker-Normalized Affine Coupling Layer in Flow-Based Architecture for Zero-Shot Multi-Speaker Text-to-Speech [J].
Choi, Byoung Jin ;
Jeong, Myeonghun ;
Lee, Joun Yeop ;
Kim, Nam Soo .
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 :2502-2506
[8]   Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding [J].
Choi, Seungwoo ;
Han, Seungju ;
Kim, Dongyoung ;
Ha, Sungjoo .
INTERSPEECH 2020, 2020, :2007-2011
[9]   Xception: Deep Learning with Depthwise Separable Convolutions [J].
Chollet, Francois .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1800-1807
[10]   ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification [J].
Desplanques, Brecht ;
Thienpondt, Jenthe ;
Demuynck, Kris .
INTERSPEECH 2020, 2020, :3830-3834