Decoupled Pronunciation and Prosody Modeling in Meta-Learning-Based Multilingual Speech Synthesis

被引:2
作者
Peng, Yukun [1 ]
Ling, Zhenhua [1 ]
机构
[1] Univ Sci & Technol China, Natl Engn Res Ctr Speech & Language Informat Proc, Hefei, Peoples R China
来源
INTERSPEECH 2022 | 2022年
基金
国家重点研发计划;
关键词
text-to-speech; speech synthesis; multilingual; meta-learning;
D O I
10.21437/Interspeech.2022-831
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper presents a method of decoupled pronunciation and prosody modeling to improve the performance of meta-learning-based multilingual speech synthesis. The baseline meta-learning synthesis method adopts a single text encoder with a parameter generator conditioned on language embeddings and a single decoder to predict mel-spectrograms for all languages. In contrast, our proposed method designs a two-stream model structure that contains two encoders and two decoders for pronunciation and prosody modeling, respectively, considering that the pronunciation knowledge and the prosody knowledge should be shared in different ways among languages. In our experiments, our proposed method effectively improved the intelligibility and naturalness of multilingual speech synthesis comparing with the baseline meta-learning synthesis method.
引用
收藏
页码:4257 / 4261
页数:5
相关论文
共 50 条
[21]   Autoregressive Zero-shot Speech Synthesis Based on Phoneme-level Prosody Modeling [J].
Yue, Huanjing ;
Wang, Jiawei ;
Yang, Jingyu .
Hunan Daxue Xuebao/Journal of Hunan University Natural Sciences, 2025, 52 (04) :114-123
[22]   Meta-Learning-Based Deep Learning Model Deployment Scheme for Edge Caching [J].
Thar, Kyi ;
Oo, Thant Zin ;
Han, Zhu ;
Hong, Choong Seon .
2019 15TH INTERNATIONAL CONFERENCE ON NETWORK AND SERVICE MANAGEMENT (CNSM), 2019,
[23]   Countering Eavesdroppers With Meta-Learning-Based Cooperative Ambient Backscatter Communications [J].
Chu, Nam H. ;
Huynh, Nguyen Van ;
Nguyen, Diep N. ;
Hoang, Dinh Thai ;
Gong, Shimin ;
Shu, Tao ;
Dutkiewicz, Eryk ;
Phan, Khoa T. .
IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, 2024, 23 (10) :13678-13693
[24]   Meta-Learning-Based Degradation Representation for Blind Super-Resolution [J].
Xia, Bin ;
Tian, Yapeng ;
Zhang, Yulun ;
Hang, Yucheng ;
Yang, Wenming ;
Liao, Qingmin .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 :3383-3396
[25]   Learn to unlearn: meta-learning-based knowledge graph embedding unlearning [J].
Xu, Naixing ;
Li, Qian ;
Li, Zhaochuan ;
Wang, Xu ;
Liu, Bingchen ;
Mpofu, Jabulani Brown ;
Li, Jingchen ;
Li, Xin .
KNOWLEDGE AND INFORMATION SYSTEMS, 2025,
[26]   Meta-Learning-Based Incremental Few-Shot Object Detection [J].
Cheng, Meng ;
Wang, Hanli ;
Long, Yu .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (04) :2158-2169
[27]   Phonetics and Machine Learning: Hierarchical Modelling of Prosody in Statistical Speech Synthesis [J].
Vainio, Martti .
STATISTICAL LANGUAGE AND SPEECH PROCESSING, SLSP 2014, 2014, 8791 :37-54
[28]   A Meta-Learning-Based Approach for Automatic First-Arrival Picking [J].
Li, Hanyang ;
Sun, Yuhang ;
Li, Jiahui ;
Li, Hang ;
Dong, Hongli .
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
[29]   Modeling stylized invariance and local variability of prosody in text-to-speech synthesis [J].
Chu, Min ;
Zhao, Yong ;
Chang, Eric .
SPEECH COMMUNICATION, 2006, 48 (06) :716-726
[30]   Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling [J].
Jiang, Yuepeng ;
Li, Tao ;
Yang, Fengyu ;
Xie, Lei ;
Menge, Meng ;
Wang, Yujun .
INTERSPEECH 2024, 2024, :2300-2304