Expressive Machine Dubbing Through Phrase-level Cross-lingual Prosody Transfer

被引:0
作者
Swiatkowski, Jakub [1 ]
Wang, Duo [1 ]
Babianski, Mikolaj [1 ]
Coccia, Giuseppe [1 ]
Tobing, Patrick Lumban [1 ]
Vipperla, Ravichander [1 ]
Klimkov, Viacheslav [1 ]
Pollet, Vincent [1 ]
机构
[1] Amazon Sci, Seattle, WA 98109 USA
来源
INTERSPEECH 2023 | 2023年
关键词
speech synthesis; cross-lingual; prosody transfer; multi-lingual; end-to-end; machine dubbing;
D O I
10.21437/Interspeech.2023-441
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech generation for machine dubbing adds complexity to conventional Text-To-Speech solutions as the generated output is required to match the expressiveness, emotion and speaking rate of the source content. Capturing and transferring details and variations in prosody is a challenge. We introduce phrase-level cross-lingual prosody transfer for expressive multi-lingual machine dubbing. The proposed phrase-level prosody transfer delivers a significant 6.2% MUSHRA score increase over a baseline with utterance-level global prosody transfer, thereby closing the gap between the baseline and expressive human dubbing by 23.2%, while preserving intelligibility of the synthesised speech.
引用
收藏
页码:5546 / 5550
页数:5
相关论文
共 30 条
  • [1] ON GRANULARITY OF PROSODIC REPRESENTATIONS IN EXPRESSIVE TEXT-TO-SPEECH
    Babianski, Mikolaj
    Pokora, Kamil
    Shah, Raahil
    Sienkiewicz, Rafal
    Korzekwa, Daniel
    Klimkov, Viacheslav
    [J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 892 - 899
  • [2] Binkowski M., 2020, INT C LEARN REPR
  • [3] Brannon W., 2021, T ASS COMPUTATIONAL
  • [4] SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech
    Cho, Hyunjae
    Jung, Wonbin
    Lee, Junhyeok
    Woo, Sang Hoon
    [J]. INTERSPEECH 2022, 2022, : 1 - 5
  • [5] XTREME-S: Evaluating Cross-lingual Speech Representations
    Conneau, Alexis
    Bapna, Ankur
    Zhang, Yu
    Ma, Min
    von Platen, Patrick
    Lozhkov, Anton
    Cherry, Colin
    Jia, Ye
    Rivera, Clara
    Kale, Mihir
    Van Esch, Daan
    Axelrod, Vera
    Khanuja, Simran
    Clark, Jonathan H.
    Firat, Orhan
    Auli, Michael
    Ruder, Sebastian
    Riesa, Jason
    Johnson, Melvin
    [J]. INTERSPEECH 2022, 2022, : 3248 - 3252
  • [6] DURATION MODELING OF NEURAL TTS FOR AUTOMATIC DUBBING
    Effendi, Johanes
    Virkar, Yogesh
    Barra-Chicote, Roberto
    Federico, Marcello
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8037 - 8041
  • [7] Federico M., 2020, IWSLT 2020
  • [8] gil Lee S., 2023, INT C LEARN REPR
  • [9] Guo Y., 2023, IEEE INT C AC SPEECH, P1
  • [10] Hsu Wei-Ning, 2019, INT C LEARN REPR