CrossSpeech plus plus : Cross-Lingual Speech Synthesis With Decoupled Language and Speaker Generation

被引：0

作者：

Kim, Ji-Hoon ^{[1
]}

Yang, Hong-Sun ^{[2
]}

Ju, Yoon-Cheol ^{[2
]}

Kim, Il-Hwan ^{[2
]}

Kim, Byeong-Yeol ^{[2
]}

Chung, Joon Son ^{[1
]}

机构：

[1] Korea Adv Inst Sci & Technol, Sch Elect Engn, Daejeon 34141, South Korea

[2] 42dot Inc, Seoul 06620, South Korea

来源：

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2025年 / 33卷

关键词：

Training; Generators; Speech synthesis; Speech processing; Linguistics; Pipelines; Feeds; Acoustics; Multilingual; Decoding; Cross-lingual speech synthesis; prosody modelling; speaker generalization; speech synthesis; TEXT-TO-SPEECH;

D O I：

10.1109/TASLPRO.2025.3547231

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

The goal of this work is to generate natural speech in multiple languages while maintaining the same speaker identity, a task known as cross-lingual speech synthesis. A key challenge of cross-lingual speech synthesis is the language-speaker entanglement problem, which causes the quality of cross-lingual systems to lag behind that of intra-lingual systems. In this paper, we propose CrossSpeech++, which effectively disentangles language and speaker information and significantly improves the quality of cross-lingual speech synthesis. To this end, we break the complex speech generation pipeline into two simple components: language-dependent and speaker-dependent generators. The language-dependent generator produces linguistic variations that are not biased by specific speaker attributes. The speaker-dependent generator models acoustic variations that characterize speaker identity. By handling each type of information in separate modules, our method can effectively disentangle language and speaker representation. We conduct extensive experiments using various metrics, and demonstrate that CrossSpeech++ achieves significant improvements in cross-lingual speech synthesis, outperforming existing methods by a large margin.

引用

页码：1364 / 1374

页数：11

共 75 条

[1]

Arik SO, 2017, PR MACH LEARN RES, V70

[2]

Arjovsky M., 2019, P INT C LEARN REPR

[3] ONE TTS ALIGNMENT TO RULE THEM ALL [J].

Badlani, Rohan ;

Lancucki, Adrian ;

Shih, Kevin J. ;

Valle, Rafael ;

Ping, Wei ;

Catanzaro, Bryan .

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :6092-6096

[4]

Baevski A, 2020, ADV NEUR IN, V33

[5]

Bernard M., 2021, Journal of Open Source Software, V6, DOI DOI 10.21105/JOSS.03958

[6]

Black A. W., 2005, P INTERSPEECH, P77, DOI 10.21437/Interspeech.2005-72

[7]

Black AW, 2007, INT CONF ACOUST SPEE, P1229

[8] XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model [J].

Casanova, Edresson ;

Davis, Kelly ;

Goelge, Eren ;

Goekncar, Gorkem ;

Gulea, Iulian ;

Hart, Logan ;

Aljafari, Aya ;

Meyer, Joshua ;

Morais, Reuben ;

Olayemi, Samuel ;

Weber, Julian .

INTERSPEECH 2024, 2024, :4978-4982

[9] MultiSpeech: Multi-Speaker Text to Speech with Transformer [J].

Chen, Mingjian ;

Tan, Xu ;

Ren, Yi ;

Xu, Jin ;

Sun, Hao ;

Zhao, Sheng ;

Qin, Tao .

INTERSPEECH 2020, 2020, :4024-4028

[10] SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech [J].

Cho, Hyunjae ;

Jung, Wonbin ;

Lee, Junhyeok ;

Woo, Sang Hoon .

INTERSPEECH 2022, 2022, :1-5

← 1 2 3 4 5 6 7 8 →