From Start to Finish: Latency Reduction Strategies for Incremental Speech Synthesis in Simultaneous Speech-to-Speech Translation

被引:1
作者
Liu, Danni [1 ]
Wang, Changhan [2 ]
Gong, Hongyu [2 ]
Ma, Xutai [2 ,3 ]
Tang, Yun [2 ]
Pino, Juan [2 ]
机构
[1] Maastricht Univ, Maastricht, Netherlands
[2] Meta AI, Menlo Pk, CA USA
[3] Johns Hopkins Univ, Baltimore, MD 21218 USA
来源
INTERSPEECH 2022 | 2022年
关键词
speech translation; text-to-speech; low-latency;
D O I
10.21437/Interspeech.2022-10568
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech-to-speech translation (S2ST) converts input speech to speech in another language. A challenge of delivering S2ST in real time is the accumulated delay between the translation and speech synthesis modules. While recently incremental text-to-speech (iTTS) models have shown large quality improvements, they typically require additional future text inputs to reach optimal performance. In this work, we minimize the initial waiting time of iTTS by adapting the upstream speech translator to generate high-quality pseudo lookahead for the speech synthesizer. After mitigating the initial delay, we demonstrate that the duration of synthesized speech also plays a crucial role on latency. We formalize this as a latency metric and then present a simple yet effective duration-scaling approach for latency reduction. Our approaches consistently reduce latency by 0.2-0.5 second without sacrificing speech translation quality.(1)
引用
收藏
页码:1771 / 1775
页数:5
相关论文
共 48 条
[41]   From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint [J].
Cai, Zexin ;
Zhang, Chuxiong ;
Li, Ming .
INTERSPEECH 2020, 2020, :3974-3978
[42]   Deep Neural Network Based Low-Latency Speech Separation with Asymmetric Analysis-Synthesis Window Pair [J].
Wang, Shanshan ;
Naithani, Gaurav ;
Politis, Archontis ;
Virtanen, Tuomas .
29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, :301-305
[43]   Automatic prediction of emotions from text in Spanish for expressive speech synthesis in the chat domain [J].
Kolz, Benjamin ;
Maria Garrido, Juan ;
Laplaza, Yesika .
PROCESAMIENTO DEL LENGUAJE NATURAL, 2014, (52) :61-68
[44]   UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS [J].
Zen, Heiga ;
Sak, Hasim .
2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, :4470-4474
[45]   Transfer Learning From Speech Synthesis to Voice Conversion With Non-Parallel Training Data [J].
Zhang, Mingyang ;
Zhou, Yi ;
Zhao, Li ;
Li, Haizhou .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :1290-1302
[46]   Cantonese neural speech synthesis from found newscasting video data and its speaker adaptation [J].
Chung, Raymond .
2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, :265-269
[47]   ChatGPT-EDSS: Empathetic Dialogue Speech Synthesis Trained from ChatGPT-derived ContextWord Embeddings [J].
Saito, Yuki ;
Takamichi, Shinnosuke ;
Iimori, Eiji ;
Tachibana, Kentaro ;
Saruwatari, Hiroshi .
INTERSPEECH 2023, 2023, :3048-3052
[48]   Speech Translation From Darija to Modern Standard Arabic: Fine-Tuning Whisper on the Darija-C Corpus Across Six Model Versions [J].
Labied, Maria ;
Belangour, Abdessamad ;
Banane, Mouad .
IEEE ACCESS, 2025, 13 :48656-48671