Expressive Machine Dubbing Through Phrase-level Cross-lingual Prosody Transfer

被引：0

作者：

Swiatkowski, Jakub ^{[1
]}

Wang, Duo ^{[1
]}

Babianski, Mikolaj ^{[1
]}

Coccia, Giuseppe ^{[1
]}

Tobing, Patrick Lumban ^{[1
]}

Vipperla, Ravichander ^{[1
]}

Klimkov, Viacheslav ^{[1
]}

Pollet, Vincent ^{[1
]}

机构：

[1] Amazon Sci, Seattle, WA 98109 USA

来源：

INTERSPEECH 2023 | 2023年

关键词：

speech synthesis; cross-lingual; prosody transfer; multi-lingual; end-to-end; machine dubbing;

D O I：

10.21437/Interspeech.2023-441

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speech generation for machine dubbing adds complexity to conventional Text-To-Speech solutions as the generated output is required to match the expressiveness, emotion and speaking rate of the source content. Capturing and transferring details and variations in prosody is a challenge. We introduce phrase-level cross-lingual prosody transfer for expressive multi-lingual machine dubbing. The proposed phrase-level prosody transfer delivers a significant 6.2% MUSHRA score increase over a baseline with utterance-level global prosody transfer, thereby closing the gap between the baseline and expressive human dubbing by 23.2%, while preserving intelligibility of the synthesised speech.

引用

页码：5546 / 5550

页数：5

共 30 条

[1] ON GRANULARITY OF PROSODIC REPRESENTATIONS IN EXPRESSIVE TEXT-TO-SPEECH
Babianski, Mikolaj
Pokora, Kamil
Shah, Raahil
Sienkiewicz, Rafal
Korzekwa, Daniel
Klimkov, Viacheslav
[J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 892 - 899
[2] Binkowski M., 2020, INT C LEARN REPR
[3] Brannon W., 2021, T ASS COMPUTATIONAL
[4] SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech
Cho, Hyunjae
Jung, Wonbin
Lee, Junhyeok
Woo, Sang Hoon
[J]. INTERSPEECH 2022, 2022, : 1 - 5
[5] XTREME-S: Evaluating Cross-lingual Speech Representations
Conneau, Alexis
Bapna, Ankur
Zhang, Yu
Ma, Min
von Platen, Patrick
Lozhkov, Anton
Cherry, Colin
Jia, Ye
Rivera, Clara
Kale, Mihir
Van Esch, Daan
Axelrod, Vera
Khanuja, Simran
Clark, Jonathan H.
Firat, Orhan
Auli, Michael
Ruder, Sebastian
Riesa, Jason
Johnson, Melvin
[J]. INTERSPEECH 2022, 2022, : 3248 - 3252
[6] DURATION MODELING OF NEURAL TTS FOR AUTOMATIC DUBBING
Effendi, Johanes
Virkar, Yogesh
Barra-Chicote, Roberto
Federico, Marcello
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8037 - 8041
[7] Federico M., 2020, IWSLT 2020
[8] gil Lee S., 2023, INT C LEARN REPR
[9] Guo Y., 2023, IEEE INT C AC SPEECH, P1
[10] Hsu Wei-Ning, 2019, INT C LEARN REPR

← 1 2 3 →