METTS: Multilingual Emotional Text-to-Speech by Cross-Speaker and Cross-Lingual Emotion Transfer

被引：4

作者：

Zhu, Xinfa ^{[1
]}

Lei, Yi ^{[1
]}

Li, Tao ^{[1
]}

Zhang, Yongmao ^{[1
]}

Zhou, Hongbin ^{[2
]}

Lu, Heng ^{[2
]}

Xie, Lei ^{[1
]}

机构：

[1] Northwestern Polytech Univ, Sch Comp Sci, Audio Speech & Language Proc Grp ASLPNPU, Xian 710072, Peoples R China

[2] Ximalaya Inc, Shanghai 201203, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

关键词：

Cross-lingual; disentanglement; emotion transfer; speech synthesis; RECOGNITION; PROSODY;

D O I：

10.1109/TASLP.2024.3363444

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Previous multilingual text-to-speech (TTS) approaches have considered leveraging monolingual speaker data to enable cross-lingual speech synthesis. However, such data-efficient approaches have ignored synthesizing emotional aspects of speech due to the challenges of cross-speaker cross-lingual emotion transfer - the heavy entanglement of speaker timbre, emotion and language factors in the speech signal will make a system to produce cross-lingual synthetic speech with an undesired foreign accent and weak emotion expressiveness. This paper proposes a Multilingual Emotional TTS (METTS) model to mitigate these problems, realizing both cross-speaker and cross-lingual emotion transfer. Specifically, METTS takes DelightfulTTS as the backbone model and proposes the following designs. First, to alleviate the foreign accent problem, METTS introduces multi-scale emotion modeling to disentangle speech prosody into coarse-grained and fine-grained scales, producing language-agnostic and language-specific emotion representations, respectively. Second, as a pre-processing step, formant shift based information perturbation is applied to the reference signal for better disentanglement of speaker timbre in the speech. Third, a vector quantization based emotion matcher is designed for reference selection, leading to decent naturalness and emotion diversity in cross-lingual synthetic speech. Experiments demonstrate the good design of METTS.

引用

页码：1506 / 1518

页数：13

共 50 条

[21] Speech Emotion Recognition with Cross-lingual Databases
Chiou, Bo-Chang
Chen, Chia-Ping
15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 558 - 561
[22] Cross-Lingual Speech-to-Text Summarization
Pontes, Elvys Linhares
Gonzalez-Gallardo, Carlos-Emiliano
Torres-Moreno, Juan-Manuel
Huet, Stephane
MULTIMEDIA AND NETWORK INFORMATION SYSTEMS, 2019, 833 : 385 - 395
[23] Improve Cross-Lingual Text-To-Speech Synthesis on Monolingual Corpora with Pitch Contour Information
Zhan, Haoyue
Zhang, Haitong
Ou, Wenjie
Lin, Yue
INTERSPEECH 2021, 2021, : 1599 - 1603
[24] Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation
Byambadorj, Zolzaya
Nishimura, Ryota
Ayush, Altangerel
Ohta, Kengo
Kitaoka, Norihide
EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2021, 2021 (01)
[25] Cross-lingual and Multilingual CLIP
Carlsson, Fredrik
Eisen, Philipp
Rekathati, Faton
Sahlgren, Magnus
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6848 - 6854
[26] Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation
Zolzaya Byambadorj
Ryota Nishimura
Altangerel Ayush
Kengo Ohta
Norihide Kitaoka
EURASIP Journal on Audio, Speech, and Music Processing, 2021
[27] A multilingual text mining approach to web cross-lingual text retrieval
Chau, RW
Yeh, CH
KNOWLEDGE-BASED SYSTEMS, 2004, 17 (5-6) : 219 - 227
[28] mCLIP: Multilingual CLIP via Cross-lingual Transfer
Chen, Guanhua
Hou, Lu
Chen, Yun
Dai, Wenliang
Shang, Lifeng
Jiang, Xin
Liu, Qun
Pan, Jia
Wang, Wenping
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 13028 - 13043
[29] LNACont: Language-normalized Affine Coupling Layer with contrastive learning for Cross-lingual Multi-speaker Text-to-speech
Hwang, Sungwoong
Kim, Changhwan
32ND EUROPEAN SIGNAL PROCESSING CONFERENCE, EUSIPCO 2024, 2024, : 391 - 395
[30] Gender Bias in Multilingual Embeddings and Cross-Lingual Transfer
Zhao, Jieyu
Mukherjee, Subhabrata
Hosseini, Saghar
Chang, Kai-Wei
Awadallah, Ahmed Hassan
58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 2896 - 2907

← 1 2 3 4 5 →