SC-CNN: Effective Speaker Conditioning Method for Zero-Shot Multi-Speaker Text-to-Speech Systems

被引：6

作者：

Yoon, Hyungchan ^{[1
]}

Kim, Changhwan ^{[1
]}

Um, Seyun ^{[1
]}

Yoon, Hyun-Wook ^{[2
]}

Kang, Hong-Goo ^{[1
]}

机构：

[1] Yonsei Univ, Dept Elect & Elect Engn, Seoul 03722, South Korea

[2] Naver Corp, Clova Voice, Seongnam 13561, South Korea

来源：

IEEE SIGNAL PROCESSING LETTERS | 2023年 / 30卷

关键词：

Generalization; text-to-speech; zero-shot; multi-speaker; style transfer;

D O I：

10.1109/LSP.2023.3277786

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

This letter proposes an effective speaker-conditioning method that is applicable to zero-shot multi-speaker text-to-speech (ZSM-TTS) systems. Based on the inductive bias in the speech generation task, in which local context information in text/phoneme sequences heavily affect the speaker characteristics of the output speech, we propose a Speaker-Conditional Convolutional Neural Network (SC-CNN) for the ZSM-TTS task. SC-CNN first predicts convolutional kernels from each learned speaker embedding, then applies 1-D convolutions to phoneme sequences with the predicted kernels. It utilizes the aforementioned inductive bias and effectively models the characteristic of speech by providing the speaker-specific local context in phonetic domain. We also build both FastSpeech2 and VITS-based ZSM-TTS systems to verify its superiority over conventional speaker conditioning methods. The results confirm that the models with SC-CNN outperform the recent ZSM-TTS models in terms of both subjective and objective measurements.

引用

页码：593 / 597

页数：5

共 34 条

[1]

[Anonymous], 2016, ARXIV PREPR ARXIV160, DOI DOI 10.48550/ARXIV.1609.03499

[2]

Arik SÖ, 2017, ADV NEUR IN, V30

[3]

Casanova E, 2022, PR MACH LEARN RES

[4]

Chen M., 2020, PROC INT C LEARN REP

[5] MultiSpeech: Multi-Speaker Text to Speech with Transformer [J].

Chen, Mingjian ;

Tan, Xu ;

Ren, Yi ;

Xu, Jin ;

Sun, Hao ;

Zhao, Sheng ;

Qin, Tao .

INTERSPEECH 2020, 2020, :4024-4028

[6] INVESTIGATING ON INCORPORATING PRETRAINED AND LEARNABLE SPEAKER REPRESENTATIONS FOR MULTI-SPEAKER MULTI-STYLE TEXT-TO-SPEECH [J].

Chien, Chung-Ming ;

Lin, Jheng-Hao ;

Huang, Chien-yu ;

Hsu, Po-chun ;

Lee, Hung-yi .

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :8588-8592

[7] SNAC: Speaker-Normalized Affine Coupling Layer in Flow-Based Architecture for Zero-Shot Multi-Speaker Text-to-Speech [J].

Choi, Byoung Jin ;

Jeong, Myeonghun ;

Lee, Joun Yeop ;

Kim, Nam Soo .

IEEE SIGNAL PROCESSING LETTERS, 2022, 29 :2502-2506

[8] Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding [J].

Choi, Seungwoo ;

Han, Seungju ;

Kim, Dongyoung ;

Ha, Sungjoo .

INTERSPEECH 2020, 2020, :2007-2011

[9] Xception: Deep Learning with Depthwise Separable Convolutions [J].

Chollet, Francois .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1800-1807

[10] ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification [J].

Desplanques, Brecht ;

Thienpondt, Jenthe ;

Demuynck, Kris .

INTERSPEECH 2020, 2020, :3830-3834

← 1 2 3 4 →