METTS: Multilingual Emotional Text-to-Speech by Cross-Speaker and Cross-Lingual Emotion Transfer

被引:4
|
作者
Zhu, Xinfa [1 ]
Lei, Yi [1 ]
Li, Tao [1 ]
Zhang, Yongmao [1 ]
Zhou, Hongbin [2 ]
Lu, Heng [2 ]
Xie, Lei [1 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Audio Speech & Language Proc Grp ASLPNPU, Xian 710072, Peoples R China
[2] Ximalaya Inc, Shanghai 201203, Peoples R China
关键词
Cross-lingual; disentanglement; emotion transfer; speech synthesis; RECOGNITION; PROSODY;
D O I
10.1109/TASLP.2024.3363444
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Previous multilingual text-to-speech (TTS) approaches have considered leveraging monolingual speaker data to enable cross-lingual speech synthesis. However, such data-efficient approaches have ignored synthesizing emotional aspects of speech due to the challenges of cross-speaker cross-lingual emotion transfer - the heavy entanglement of speaker timbre, emotion and language factors in the speech signal will make a system to produce cross-lingual synthetic speech with an undesired foreign accent and weak emotion expressiveness. This paper proposes a Multilingual Emotional TTS (METTS) model to mitigate these problems, realizing both cross-speaker and cross-lingual emotion transfer. Specifically, METTS takes DelightfulTTS as the backbone model and proposes the following designs. First, to alleviate the foreign accent problem, METTS introduces multi-scale emotion modeling to disentangle speech prosody into coarse-grained and fine-grained scales, producing language-agnostic and language-specific emotion representations, respectively. Second, as a pre-processing step, formant shift based information perturbation is applied to the reference signal for better disentanglement of speaker timbre in the speech. Third, a vector quantization based emotion matcher is designed for reference selection, leading to decent naturalness and emotion diversity in cross-lingual synthetic speech. Experiments demonstrate the good design of METTS.
引用
收藏
页码:1506 / 1518
页数:13
相关论文
共 50 条
  • [1] CROSS-SPEAKER STYLE TRANSFER FOR TEXT-TO-SPEECH USING DATA AUGMENTATION
    Ribeiro, Manuel Sam
    Roth, Julian
    Comini, Giulia
    Huybrechts, Goeric
    Gabrys, Adam
    Lorenzo-Trueba, Jaime
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6797 - 6801
  • [2] CROSS-LINGUAL TEXT-TO-SPEECH VIA HIERARCHICAL STYLE TRANSFER
    Lee, Sang-Hoon
    Choi, Ha-Yeong
    Lee, Seong-Whan
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 25 - 26
  • [3] DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech
    Liu, Sen
    Guo, Yiwei
    Du, Chenpeng
    Chen, Xie
    Yu, Kai
    INTERSPEECH 2023, 2023, : 616 - 620
  • [4] Incorporating Cross-speaker Style Transfer for Multi-language Text-to-Speech
    Shang, Zengqiang
    Huang, Zhihua
    Zhang, Haozhe
    Zhang, Pengyuan
    Yan, Yonghong
    INTERSPEECH 2021, 2021, : 1619 - 1623
  • [5] Cross-lingual, Multi-speaker Text-To-Speech Synthesis Using Neural Speaker Embedding
    Chen, Mengnan
    Chen, Minchuan
    Liang, Shuang
    Ma, Jun
    Chen, Lei
    Wang, Shaojun
    Xiao, Jing
    INTERSPEECH 2019, 2019, : 2105 - 2109
  • [6] Cross-Speaker Emotion Transfer Through Information Perturbation in Emotional Speech Synthesis
    Lei, Yi
    Yang, Shan
    Zhu, Xinfa
    Xie, Lei
    Su, Dan
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 1948 - 1952
  • [7] Cross-lingual Speaker Adaptation using Domain Adaptation and Speaker Consistency Loss for Text-To-Speech Synthesis
    Xin, Detai
    Saito, Yuki
    Takamichi, Shinnosuke
    Koriyama, Tomoki
    Saruwatari, Hiroshi
    INTERSPEECH 2021, 2021, : 1614 - 1618
  • [8] Cross-lingual speaker adaptation using domain adaptation and speaker consistency loss for text-to-speech synthesis
    Xin, Detai
    Saito, Yuki
    Takamichi, Shinnosuke
    Koriyama, Tomoki
    Saruwatari, Hiroshi
    Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2021, 5 : 3376 - 3380
  • [9] CROSS-LINGUAL AND MULTILINGUAL SPEECH EMOTION RECOGNITION ON ENGLISH AND FRENCH
    Neumann, Michael
    Ngoc Thang Vu
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5769 - 5773
  • [10] Multilingual, Cross-lingual, and Monolingual Speech Emotion Recognition on EmoFilm Dataset
    Atmaja, Bagus Tris
    Sasou, Akira
    2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1019 - 1025