METTS: Multilingual Emotional Text-to-Speech by Cross-Speaker and Cross-Lingual Emotion Transfer

被引:4
|
作者
Zhu, Xinfa [1 ]
Lei, Yi [1 ]
Li, Tao [1 ]
Zhang, Yongmao [1 ]
Zhou, Hongbin [2 ]
Lu, Heng [2 ]
Xie, Lei [1 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Audio Speech & Language Proc Grp ASLPNPU, Xian 710072, Peoples R China
[2] Ximalaya Inc, Shanghai 201203, Peoples R China
关键词
Cross-lingual; disentanglement; emotion transfer; speech synthesis; RECOGNITION; PROSODY;
D O I
10.1109/TASLP.2024.3363444
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Previous multilingual text-to-speech (TTS) approaches have considered leveraging monolingual speaker data to enable cross-lingual speech synthesis. However, such data-efficient approaches have ignored synthesizing emotional aspects of speech due to the challenges of cross-speaker cross-lingual emotion transfer - the heavy entanglement of speaker timbre, emotion and language factors in the speech signal will make a system to produce cross-lingual synthetic speech with an undesired foreign accent and weak emotion expressiveness. This paper proposes a Multilingual Emotional TTS (METTS) model to mitigate these problems, realizing both cross-speaker and cross-lingual emotion transfer. Specifically, METTS takes DelightfulTTS as the backbone model and proposes the following designs. First, to alleviate the foreign accent problem, METTS introduces multi-scale emotion modeling to disentangle speech prosody into coarse-grained and fine-grained scales, producing language-agnostic and language-specific emotion representations, respectively. Second, as a pre-processing step, formant shift based information perturbation is applied to the reference signal for better disentanglement of speaker timbre in the speech. Third, a vector quantization based emotion matcher is designed for reference selection, leading to decent naturalness and emotion diversity in cross-lingual synthetic speech. Experiments demonstrate the good design of METTS.
引用
收藏
页码:1506 / 1518
页数:13
相关论文
共 50 条
  • [41] Cross-lingual Emotion Detection
    Hassan, Sabit
    Shaar, Shaden
    Darwish, Kareem
    2022 Language Resources and Evaluation Conference, LREC 2022, 2022, : 6948 - 6958
  • [42] Daft-Exprt: Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis
    Zaidi, Julian
    Seute, Hugo
    van Niekerk, Benjamin
    Carbonneau, Marc-Andre
    INTERSPEECH 2022, 2022, : 4591 - 4595
  • [43] Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS
    Shin, Yookyung
    Lee, Younggun
    Jo, Suhee
    Hwang, Yeongtae
    Kim, Taesu
    INTERSPEECH 2022, 2022, : 2313 - 2317
  • [44] Cross-Lingual Validation of Multilingual Wordnets
    Tufis, Dan
    Ion, Radu
    Barbu, Eduard
    Barbu, Verginica
    GWC 2004: SECOND INTERNATIONAL WORDNET CONFERENCE, PROCEEDINGS, 2003, : 332 - 340
  • [45] Cross-lingual Emotion Detection
    Hassan, Sabit
    Shaar, Shaden
    Darwish, Kareem
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6948 - 6958
  • [46] Syntax-augmented Multilingual BERT for Cross-lingual Transfer
    Ahmad, Wasi Uddin
    Li, Haoran
    Chang, Kai-Wei
    Mehdad, Yashar
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 4538 - 4554
  • [47] Cross-Lingual Transfer Learning for Multilingual Task Oriented Dialog
    Schuster, Sebastian
    Gupta, Sonal
    Shah, Rushin
    Lewis, Mike
    2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 3795 - 3805
  • [48] Multilingual Pixel Representations for Translation and Effective Cross-lingual Transfer
    Salesky, Elizabeth
    Verma, Neha
    Koehn, Philipp
    Post, Matt
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 13845 - 13861
  • [49] Emotion Detection in Cross-Lingual Text Based on Bidirectional LSTM
    Ren, Han
    Wan, Jing
    Ren, Yafeng
    SECURITY WITH INTELLIGENT COMPUTING AND BIG-DATA SERVICES, 2020, 895 : 838 - 845
  • [50] Cross-lingual Speech Emotion Recognition through Factor Analysis
    Desplanques, Brecht
    Demuynck, Kris
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 3648 - 3652