From Start to Finish: Latency Reduction Strategies for Incremental Speech Synthesis in Simultaneous Speech-to-Speech Translation

被引:1
作者
Liu, Danni [1 ]
Wang, Changhan [2 ]
Gong, Hongyu [2 ]
Ma, Xutai [2 ,3 ]
Tang, Yun [2 ]
Pino, Juan [2 ]
机构
[1] Maastricht Univ, Maastricht, Netherlands
[2] Meta AI, Menlo Pk, CA USA
[3] Johns Hopkins Univ, Baltimore, MD 21218 USA
来源
INTERSPEECH 2022 | 2022年
关键词
speech translation; text-to-speech; low-latency;
D O I
10.21437/Interspeech.2022-10568
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech-to-speech translation (S2ST) converts input speech to speech in another language. A challenge of delivering S2ST in real time is the accumulated delay between the translation and speech synthesis modules. While recently incremental text-to-speech (iTTS) models have shown large quality improvements, they typically require additional future text inputs to reach optimal performance. In this work, we minimize the initial waiting time of iTTS by adapting the upstream speech translator to generate high-quality pseudo lookahead for the speech synthesizer. After mitigating the initial delay, we demonstrate that the duration of synthesized speech also plays a crucial role on latency. We formalize this as a latency metric and then present a simple yet effective duration-scaling approach for latency reduction. Our approaches consistently reduce latency by 0.2-0.5 second without sacrificing speech translation quality.(1)
引用
收藏
页码:1771 / 1775
页数:5
相关论文
共 45 条
  • [21] Enabling effective design of multimodal interfaces for speech-to-speech translation system: An empirical study of longitudinal user behaviors over time and user strategies for coping with errors
    Shin, JongHo
    Georgiou, Panayiotis G.
    Narayanan, Shrikanth
    COMPUTER SPEECH AND LANGUAGE, 2013, 27 (02) : 554 - 571
  • [22] Simple, Lexicalized Choice of Translation Timing for Simultaneous Speech Translation
    Fujita, Tomoki
    Neubig, Graham
    Sakti, Sakriani
    Toda, Tomoki
    Nakamura, Satoshi
    14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 3454 - 3458
  • [23] Automatic Speech-to-Speech Translation of Educational Videos Using SeamlessM4T and Its Use for Future VR Applications
    Stefanel Gris, Lucas Rafael
    Fernandes, Diogo
    de Oliveira, Frederico Santos
    Soares, Anderson
    de Lima Soares, Telma Woerle
    Galvao, Arlindo
    2024 IEEE CONFERENCE ON VIRTUAL REALITY AND 3D USER INTERFACES ABSTRACTS AND WORKSHOPS, VRW 2024, 2024, : 163 - 166
  • [24] Cross-Modal Decision Regularization for Simultaneous Speech Translation
    Zaidi, Mohd Abbas
    Lee, Beomseok
    Kim, Sangha
    Kim, Chanwoo
    INTERSPEECH 2022, 2022, : 116 - 120
  • [25] Low-Latency Sequence-to-Sequence Speech Recognition and Translation by Partial Hypothesis Selection
    Liu, Danni
    Spanakis, Gerasimos
    Niehues, Jan
    INTERSPEECH 2020, 2020, : 3620 - 3624
  • [26] Robust Lecture Speech Translation for Speech Misrecognition and Its Rescoring Effect from Multiple Candidates
    Sahashi, Koya
    Goto, Norioki
    Seki, Hiroshi
    Yamamoto, Kazumasa
    Akiba, Tomoyoshi
    Nakagawa, Seiichi
    2017 4TH INTERNATIONAL CONFERENCE ON ADVANCED INFORMATICS, CONCEPTS, THEORY, AND APPLICATIONS (ICAICTA) PROCEEDINGS, 2017,
  • [27] Blockwise Streaming Transformer for Spoken Language Understanding and Simultaneous Speech Translation
    Deng, Keqi
    Watanabe, Shinji
    Shi, Jiatong
    Arora, Siddhant
    INTERSPEECH 2022, 2022, : 1746 - 1750
  • [28] Language model adaptation in machine translation from speech
    Bulyko, Ivan
    Matsoukas, Spyros
    Schwartz, Richard
    Nguyen, Long
    Makhoul, John
    2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 117 - +
  • [29] Lost in Interpreting: Speech Translation from Source or Interpreter?
    Machacek, Dominik
    Zilinec, Matus
    Bojar, Ondrej
    INTERSPEECH 2021, 2021, : 2376 - 2380
  • [30] VioLA: Conditional Language Models for Speech Recognition, Synthesis, and Translation
    Wang, Tianrui
    Zhou, Long
    Zhang, Ziqiang
    Wu, Yu
    Liu, Shujie
    Gaur, Yashesh
    Chen, Zhuo
    Li, Jinyu
    Wei, Furu
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3709 - 3716