Decoupled Pronunciation and Prosody Modeling in Meta-Learning-Based Multilingual Speech Synthesis

被引:1
|
作者
Peng, Yukun [1 ]
Ling, Zhenhua [1 ]
机构
[1] Univ Sci & Technol China, Natl Engn Res Ctr Speech & Language Informat Proc, Hefei, Peoples R China
来源
INTERSPEECH 2022 | 2022年
基金
国家重点研发计划;
关键词
text-to-speech; speech synthesis; multilingual; meta-learning;
D O I
10.21437/Interspeech.2022-831
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper presents a method of decoupled pronunciation and prosody modeling to improve the performance of meta-learning-based multilingual speech synthesis. The baseline meta-learning synthesis method adopts a single text encoder with a parameter generator conditioned on language embeddings and a single decoder to predict mel-spectrograms for all languages. In contrast, our proposed method designs a two-stream model structure that contains two encoders and two decoders for pronunciation and prosody modeling, respectively, considering that the pronunciation knowledge and the prosody knowledge should be shared in different ways among languages. In our experiments, our proposed method effectively improved the intelligibility and naturalness of multilingual speech synthesis comparing with the baseline meta-learning synthesis method.
引用
收藏
页码:4257 / 4261
页数:5
相关论文
共 50 条
  • [1] Multilingual context-based pronunciation learning for Text-to-Speech
    Comini, Giulia
    Ribeiro, Manuel Sam
    Yang, Fan
    Shim, Heereen
    Lorenzo-Trueba, Jaime
    INTERSPEECH 2023, 2023, : 631 - 635
  • [2] Pronunciation Dictionary-Free Multilingual Speech Synthesis Using Learned Phonetic Representations
    Liu, Chang
    Ling, Zhen-Hua
    Chen, Ling-Hui
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 3706 - 3716
  • [3] Pronunciation Dictionary-Free Multilingual Speech Synthesis by Combining Unsupervised and Supervised Phonetic Representations
    Liu, Chang
    Ling, Zhen-Hua
    Chen, Ling-Hui
    INTERSPEECH 2022, 2022, : 4282 - 4286
  • [4] One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech
    Nekvinda, Tomas
    Dusek, Ondrej
    INTERSPEECH 2020, 2020, : 2972 - 2976
  • [5] Meta-Learning-Based Deep Reinforcement Learning for Multiobjective Optimization Problems
    Zhang, Zizhen
    Wu, Zhiyuan
    Zhang, Hang
    Wang, Jiahai
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (10) : 7978 - 7991
  • [6] HIERARCHICAL PROSODY MODELING FOR NON-AUTOREGRESSIVE SPEECH SYNTHESIS
    Chien, Chung-Ming
    Lee, Hung-yi
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 446 - 453
  • [7] Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit
    Zeng, Zhen
    Wang, Jianzong
    Cheng, Ning
    Xiao, Jing
    INTERSPEECH 2020, 2020, : 4422 - 4426
  • [8] MEASURING THE EFFECT OF LINGUISTIC RESOURCES ON PROSODY MODELING FOR SPEECH SYNTHESIS
    Rosenberg, Andrew
    Fernandez, Raul
    Ramabhadran, Bhuvana
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5114 - 5118
  • [9] Multilingual recognition of non-native speech using acoustic model transformation and pronunciation modeling
    Bouselmi, G.
    Fohr, D.
    Illina, I.
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2012, 15 (02) : 203 - 213
  • [10] Multilingual recognition of non-native speech using acoustic model transformation and pronunciation modeling
    G. Bouselmi
    D. Fohr
    I. Illina
    International Journal of Speech Technology, 2012, 15 (2) : 203 - 213