SimulSpeech: End-to-End Simultaneous Speech to Text Translation

被引:0
|
作者
Ren, Yi [1 ]
Liu, Jinglin [1 ]
Tan, Xu [2 ]
Zhang, Chen [1 ]
Qin, Tao [2 ]
Zhao, Zhou [1 ]
Liu, Tie-Yan [2 ]
机构
[1] Zhejiang Univ, Hangzhou, Zhejiang, Peoples R China
[2] Microsoft Res, Redmond, WA USA
基金
中国国家自然科学基金; 国家重点研发计划; 浙江省自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this work, we develop SimulSpeech, an end-to-end simultaneous speech to text translation system which translates speech in source language to text in target language concurrently. SimulSpeech consists of a speech encoder, a speech segmenter and a text decoder, where 1) the segmenter builds upon the encoder and leverages a connectionist temporal classification (CTC) loss to split the input streaming speech in real time, 2) the encoder-decoder attention adopts a wait-k strategy for simultaneous translation. SimulSpeech is more challenging than previous cascaded systems (with simultaneous automatic speech recognition (ASR) and simultaneous neural machine translation (NMT)). We introduce two novel knowledge distillation methods to ensure the performance: 1) Attention-level knowledge distillation transfers the knowledge from the multiplication of the attention matrices of simultaneous NMT and ASR models to help the training of the attention mechanism in SimulSpeech; 2) Data-level knowledge distillation transfers the knowledge from the full-sentence NMT model and also reduces the complexity of data distribution to help on the optimization of SimulSpeech. Experiments on MuST-C English-Spanish and English-German spoken language translation datasets show that SimulSpeech achieves reasonable BLEU scores and lower delay compared to full-sentence end-to-end speech to text translation (without simultaneous translation), and better performance than the two-stage cascaded simultaneous translation model in terms of BLEU scores and translation delay.
引用
收藏
页码:3787 / 3796
页数:10
相关论文
共 50 条
  • [41] End-to-End Thai Text-to-Speech with Linguistic Unit
    Wisetpaitoon, Kontawat
    Singkul, Sattaya
    Sakdejayont, Theerat
    Chalothorn, Tawunrat
    PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 951 - 959
  • [42] Curriculum Pre-training for End-to-End Speech Translation
    Wang, Chengyi
    Wu, Yu
    Liu, Shujie
    Zhou, Ming
    Yang, Zhenglu
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 3728 - 3738
  • [43] Mutual-Learning Improves End-to-End Speech Translation
    Zhao, Jiawei
    Luo, Wei
    Chen, Boxing
    Gilman, Andrew
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 3989 - 3994
  • [44] Improving End-to-End Speech Translation with Progressive Dual Encoding
    Zhang, Runlai
    Chen, Saihan
    Zhang, Yuhao
    Du, Yangfan
    Chen, Hao
    Xiao, Tong
    Zhu, Jingbo
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT III, NLPCC 2024, 2025, 15361 : 199 - 212
  • [45] TIGHT INTEGRATED END-TO-END TRAINING FOR CASCADED SPEECH TRANSLATION
    Bahar, Parnia
    Bieschke, Tobias
    Schlueter, Ralf
    Ney, Hermann
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 950 - 957
  • [46] Towards a Deep Understanding of Multilingual End-to-End Speech Translation
    Sun, Haoran
    Zhao, Xiaohu
    Lei, Yikun
    Zhu, Shaolin
    Xiong, Deyi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 14332 - 14348
  • [47] Regularizing End-to-End Speech Translation with Triangular Decomposition Agreement
    Du, Yichao
    Zhang, Zhirui
    Wang, Weizhi
    Chen, Boxing
    Xie, Jun
    Xu, Tong
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 10590 - 10598
  • [48] Knowledge Distillation on Joint Task End-to-End Speech Translation
    Nayem, Khandokar Md
    Xue, Ran
    Chang, Ching-Yun
    Shanbhogue, Akshaya Vishnu Kudlu
    INTERSPEECH 2023, 2023, : 1493 - 1497
  • [49] SHAS: Approaching optimal Segmentation for End-to-End Speech Translation
    Tsiamas, Ioannis
    Gallego, Gerard I.
    Fonollosa, Jose A. R.
    Costa-jussa, Marta R.
    INTERSPEECH 2022, 2022, : 106 - 110
  • [50] PromptST: Abstract Prompt Learning for End-to-End Speech Translation
    Yu, Tengfei
    Ding, Liang
    Liu, Xuebo
    Chen, Kehai
    Zhang, Meishan
    Tao, Dacheng
    Zhang, Min
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 10140 - 10154