Decoupled Pronunciation and Prosody Modeling in Meta-Learning-Based Multilingual Speech Synthesis

被引：1

作者：

Peng, Yukun ^{[1
]}

Ling, Zhenhua ^{[1
]}

机构：

[1] Univ Sci & Technol China, Natl Engn Res Ctr Speech & Language Informat Proc, Hefei, Peoples R China

来源：

INTERSPEECH 2022 | 2022年

基金：

国家重点研发计划;

关键词：

text-to-speech; speech synthesis; multilingual; meta-learning;

D O I：

10.21437/Interspeech.2022-831

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper presents a method of decoupled pronunciation and prosody modeling to improve the performance of meta-learning-based multilingual speech synthesis. The baseline meta-learning synthesis method adopts a single text encoder with a parameter generator conditioned on language embeddings and a single decoder to predict mel-spectrograms for all languages. In contrast, our proposed method designs a two-stream model structure that contains two encoders and two decoders for pronunciation and prosody modeling, respectively, considering that the pronunciation knowledge and the prosody knowledge should be shared in different ways among languages. In our experiments, our proposed method effectively improved the intelligibility and naturalness of multilingual speech synthesis comparing with the baseline meta-learning synthesis method.

引用

页码：4257 / 4261

页数：5

共 50 条

[1] Multilingual context-based pronunciation learning for Text-to-Speech
Comini, Giulia
Ribeiro, Manuel Sam
Yang, Fan
Shim, Heereen
Lorenzo-Trueba, Jaime
INTERSPEECH 2023, 2023, : 631 - 635
[2] Pronunciation Dictionary-Free Multilingual Speech Synthesis Using Learned Phonetic Representations
Liu, Chang
Ling, Zhen-Hua
Chen, Ling-Hui
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 3706 - 3716
[3] Pronunciation Dictionary-Free Multilingual Speech Synthesis by Combining Unsupervised and Supervised Phonetic Representations
Liu, Chang
Ling, Zhen-Hua
Chen, Ling-Hui
INTERSPEECH 2022, 2022, : 4282 - 4286
[4] One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech
Nekvinda, Tomas
Dusek, Ondrej
INTERSPEECH 2020, 2020, : 2972 - 2976
[5] Meta-Learning-Based Deep Reinforcement Learning for Multiobjective Optimization Problems
Zhang, Zizhen
Wu, Zhiyuan
Zhang, Hang
Wang, Jiahai
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (10) : 7978 - 7991
[6] HIERARCHICAL PROSODY MODELING FOR NON-AUTOREGRESSIVE SPEECH SYNTHESIS
Chien, Chung-Ming
Lee, Hung-yi
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 446 - 453
[7] Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit
Zeng, Zhen
Wang, Jianzong
Cheng, Ning
Xiao, Jing
INTERSPEECH 2020, 2020, : 4422 - 4426
[8] MEASURING THE EFFECT OF LINGUISTIC RESOURCES ON PROSODY MODELING FOR SPEECH SYNTHESIS
Rosenberg, Andrew
Fernandez, Raul
Ramabhadran, Bhuvana
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5114 - 5118
[9] Multilingual recognition of non-native speech using acoustic model transformation and pronunciation modeling
Bouselmi, G.
Fohr, D.
Illina, I.
INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2012, 15 (02) : 203 - 213
[10] Multilingual recognition of non-native speech using acoustic model transformation and pronunciation modeling
G. Bouselmi
D. Fohr
I. Illina
International Journal of Speech Technology, 2012, 15 (2) : 203 - 213

← 1 2 3 4 5 →