METTS: Multilingual Emotional Text-to-Speech by Cross-Speaker and Cross-Lingual Emotion Transfer

被引：4

作者：

Zhu, Xinfa ^{[1
]}

Lei, Yi ^{[1
]}

Li, Tao ^{[1
]}

Zhang, Yongmao ^{[1
]}

Zhou, Hongbin ^{[2
]}

Lu, Heng ^{[2
]}

Xie, Lei ^{[1
]}

机构：

[1] Northwestern Polytech Univ, Sch Comp Sci, Audio Speech & Language Proc Grp ASLPNPU, Xian 710072, Peoples R China

[2] Ximalaya Inc, Shanghai 201203, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

关键词：

Cross-lingual; disentanglement; emotion transfer; speech synthesis; RECOGNITION; PROSODY;

D O I：

10.1109/TASLP.2024.3363444

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Previous multilingual text-to-speech (TTS) approaches have considered leveraging monolingual speaker data to enable cross-lingual speech synthesis. However, such data-efficient approaches have ignored synthesizing emotional aspects of speech due to the challenges of cross-speaker cross-lingual emotion transfer - the heavy entanglement of speaker timbre, emotion and language factors in the speech signal will make a system to produce cross-lingual synthetic speech with an undesired foreign accent and weak emotion expressiveness. This paper proposes a Multilingual Emotional TTS (METTS) model to mitigate these problems, realizing both cross-speaker and cross-lingual emotion transfer. Specifically, METTS takes DelightfulTTS as the backbone model and proposes the following designs. First, to alleviate the foreign accent problem, METTS introduces multi-scale emotion modeling to disentangle speech prosody into coarse-grained and fine-grained scales, producing language-agnostic and language-specific emotion representations, respectively. Second, as a pre-processing step, formant shift based information perturbation is applied to the reference signal for better disentanglement of speaker timbre in the speech. Third, a vector quantization based emotion matcher is designed for reference selection, leading to decent naturalness and emotion diversity in cross-lingual synthetic speech. Experiments demonstrate the good design of METTS.

引用

页码：1506 / 1518

页数：13

共 50 条

[1] CROSS-SPEAKER STYLE TRANSFER FOR TEXT-TO-SPEECH USING DATA AUGMENTATION
Ribeiro, Manuel Sam
Roth, Julian
Comini, Giulia
Huybrechts, Goeric
Gabrys, Adam
Lorenzo-Trueba, Jaime
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6797 - 6801
[2] CROSS-LINGUAL TEXT-TO-SPEECH VIA HIERARCHICAL STYLE TRANSFER
Lee, Sang-Hoon
Choi, Ha-Yeong
Lee, Seong-Whan
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 25 - 26
[3] DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech
Liu, Sen
Guo, Yiwei
Du, Chenpeng
Chen, Xie
Yu, Kai
INTERSPEECH 2023, 2023, : 616 - 620
[4] Incorporating Cross-speaker Style Transfer for Multi-language Text-to-Speech
Shang, Zengqiang
Huang, Zhihua
Zhang, Haozhe
Zhang, Pengyuan
Yan, Yonghong
INTERSPEECH 2021, 2021, : 1619 - 1623
[5] Cross-lingual, Multi-speaker Text-To-Speech Synthesis Using Neural Speaker Embedding
Chen, Mengnan
Chen, Minchuan
Liang, Shuang
Ma, Jun
Chen, Lei
Wang, Shaojun
Xiao, Jing
INTERSPEECH 2019, 2019, : 2105 - 2109
[6] Cross-Speaker Emotion Transfer Through Information Perturbation in Emotional Speech Synthesis
Lei, Yi
Yang, Shan
Zhu, Xinfa
Xie, Lei
Su, Dan
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 1948 - 1952
[7] Cross-lingual Speaker Adaptation using Domain Adaptation and Speaker Consistency Loss for Text-To-Speech Synthesis
Xin, Detai
Saito, Yuki
Takamichi, Shinnosuke
Koriyama, Tomoki
Saruwatari, Hiroshi
INTERSPEECH 2021, 2021, : 1614 - 1618
[8] Cross-lingual speaker adaptation using domain adaptation and speaker consistency loss for text-to-speech synthesis
Xin, Detai
Saito, Yuki
Takamichi, Shinnosuke
Koriyama, Tomoki
Saruwatari, Hiroshi
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2021, 5 : 3376 - 3380
[9] CROSS-LINGUAL AND MULTILINGUAL SPEECH EMOTION RECOGNITION ON ENGLISH AND FRENCH
Neumann, Michael
Ngoc Thang Vu
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5769 - 5773
[10] Multilingual, Cross-lingual, and Monolingual Speech Emotion Recognition on EmoFilm Dataset
Atmaja, Bagus Tris
Sasou, Akira
2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1019 - 1025

← 1 2 3 4 5 →