METTS: Multilingual Emotional Text-to-Speech by Cross-Speaker and Cross-Lingual Emotion Transfer

被引:4
|
作者
Zhu, Xinfa [1 ]
Lei, Yi [1 ]
Li, Tao [1 ]
Zhang, Yongmao [1 ]
Zhou, Hongbin [2 ]
Lu, Heng [2 ]
Xie, Lei [1 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Audio Speech & Language Proc Grp ASLPNPU, Xian 710072, Peoples R China
[2] Ximalaya Inc, Shanghai 201203, Peoples R China
关键词
Cross-lingual; disentanglement; emotion transfer; speech synthesis; RECOGNITION; PROSODY;
D O I
10.1109/TASLP.2024.3363444
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Previous multilingual text-to-speech (TTS) approaches have considered leveraging monolingual speaker data to enable cross-lingual speech synthesis. However, such data-efficient approaches have ignored synthesizing emotional aspects of speech due to the challenges of cross-speaker cross-lingual emotion transfer - the heavy entanglement of speaker timbre, emotion and language factors in the speech signal will make a system to produce cross-lingual synthetic speech with an undesired foreign accent and weak emotion expressiveness. This paper proposes a Multilingual Emotional TTS (METTS) model to mitigate these problems, realizing both cross-speaker and cross-lingual emotion transfer. Specifically, METTS takes DelightfulTTS as the backbone model and proposes the following designs. First, to alleviate the foreign accent problem, METTS introduces multi-scale emotion modeling to disentangle speech prosody into coarse-grained and fine-grained scales, producing language-agnostic and language-specific emotion representations, respectively. Second, as a pre-processing step, formant shift based information perturbation is applied to the reference signal for better disentanglement of speaker timbre in the speech. Third, a vector quantization based emotion matcher is designed for reference selection, leading to decent naturalness and emotion diversity in cross-lingual synthetic speech. Experiments demonstrate the good design of METTS.
引用
收藏
页码:1506 / 1518
页数:13
相关论文
共 50 条
  • [21] Speech Emotion Recognition with Cross-lingual Databases
    Chiou, Bo-Chang
    Chen, Chia-Ping
    15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 558 - 561
  • [22] Cross-Lingual Speech-to-Text Summarization
    Pontes, Elvys Linhares
    Gonzalez-Gallardo, Carlos-Emiliano
    Torres-Moreno, Juan-Manuel
    Huet, Stephane
    MULTIMEDIA AND NETWORK INFORMATION SYSTEMS, 2019, 833 : 385 - 395
  • [23] Improve Cross-Lingual Text-To-Speech Synthesis on Monolingual Corpora with Pitch Contour Information
    Zhan, Haoyue
    Zhang, Haitong
    Ou, Wenjie
    Lin, Yue
    INTERSPEECH 2021, 2021, : 1599 - 1603
  • [24] Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation
    Byambadorj, Zolzaya
    Nishimura, Ryota
    Ayush, Altangerel
    Ohta, Kengo
    Kitaoka, Norihide
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2021, 2021 (01)
  • [25] Cross-lingual and Multilingual CLIP
    Carlsson, Fredrik
    Eisen, Philipp
    Rekathati, Faton
    Sahlgren, Magnus
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6848 - 6854
  • [26] Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation
    Zolzaya Byambadorj
    Ryota Nishimura
    Altangerel Ayush
    Kengo Ohta
    Norihide Kitaoka
    EURASIP Journal on Audio, Speech, and Music Processing, 2021
  • [27] A multilingual text mining approach to web cross-lingual text retrieval
    Chau, RW
    Yeh, CH
    KNOWLEDGE-BASED SYSTEMS, 2004, 17 (5-6) : 219 - 227
  • [28] mCLIP: Multilingual CLIP via Cross-lingual Transfer
    Chen, Guanhua
    Hou, Lu
    Chen, Yun
    Dai, Wenliang
    Shang, Lifeng
    Jiang, Xin
    Liu, Qun
    Pan, Jia
    Wang, Wenping
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 13028 - 13043
  • [29] LNACont: Language-normalized Affine Coupling Layer with contrastive learning for Cross-lingual Multi-speaker Text-to-speech
    Hwang, Sungwoong
    Kim, Changhwan
    32ND EUROPEAN SIGNAL PROCESSING CONFERENCE, EUSIPCO 2024, 2024, : 391 - 395
  • [30] Gender Bias in Multilingual Embeddings and Cross-Lingual Transfer
    Zhao, Jieyu
    Mukherjee, Subhabrata
    Hosseini, Saghar
    Chang, Kai-Wei
    Awadallah, Ahmed Hassan
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 2896 - 2907