ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation

被引:0
|
作者
Le, Chenyang [1 ,4 ]
Qian, Yao [2 ]
Zhou, Long [3 ]
Liu, Shujie [3 ]
Qian, Yanmin [1 ]
Zeng, Michael [2 ]
Huang, Xuedong [2 ]
机构
[1] Shanghai Jiao Tong Univ, Shanghai, Peoples R China
[2] Microsoft Cloud & AI, Redmond, WA USA
[3] Microsoft Res Asia, Beijing, Peoples R China
[4] Microsoft, Redmond, WA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Joint speech-language training is challenging due to the large demand for training data and GPU consumption, as well as the modality gap between speech and language. We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models and optimized data-efficiently for spoken language tasks. Particularly, we propose to incorporate cross-modality learning into transfer learning and conduct them simultaneously for downstream tasks in a multi-task learning manner. Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks, achieving a new state-of-the-art average BLEU score of 31.5 on the multilingual speech to English text translation task for 21 languages, as measured on the public CoVoST2 evaluation set.(2)
引用
收藏
页数:12
相关论文
共 50 条
  • [1] End-to-End Speech-to-Text Translation: A Survey
    Sethiya, Nivedita
    Maurya, Chandresh Kumar
    COMPUTER SPEECH AND LANGUAGE, 2025, 90
  • [2] Revisiting End-to-End Speech-to-Text Translation From Scratch
    Zhang, Biao
    Haddow, Barry
    Sennrich, Rico
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [3] Towards End-to-End Speech-to-Text Summarization
    Monteiro, Raul
    Pernes, Diogo
    TEXT, SPEECH, AND DIALOGUE, TSD 2023, 2023, 14102 : 304 - 316
  • [4] M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation
    Zhao, Jinming
    Yang, Hao
    Shareghi, Ehsan
    Haffari, Gholamreza
    INTERSPEECH 2022, 2022, : 111 - 115
  • [5] LEVERAGING WEAKLY SUPERVISED DATA TO IMPROVE END-TO-END SPEECH-TO-TEXT TRANSLATION
    Jia, Ye
    Johnson, Melvin
    Macherey, Wolfgang
    Weiss, Ron J.
    Cao, Yuan
    Chiu, Chung-Cheng
    Ari, Naveen
    Laurenzo, Stella
    Wu, Yonghui
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7180 - 7184
  • [6] TOWARDS END-TO-END SPEECH-TO-TEXT TRANSLATION WITH TWO-PASS DECODING
    Sung, Tzu-Wei
    Liu, Jun-You
    Lee, Hung-yi
    Lee, Lin-shan
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7175 - 7179
  • [7] Listen, Understand and Translate: Triple Supervision Decouples End-to-end Speech-to-text Translation
    Dong, Qianqian
    Ye, Rong
    Wang, Mingxuan
    Zhou, Hao
    Xu, Shuang
    Xu, Bo
    Li, Lei
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 12749 - 12759
  • [8] AdaST: Dynamically Adapting Encoder States in the Decoder for End-to-End Speech-to-Text Translation
    Huang, Wuwei
    Wang, Dexin
    Xiong, Deyi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 2539 - 2545
  • [9] SpecRec: An Alternative Solution for Improving End-to-End Speech-to-Text Translation via Spectrogram Reconstruction
    Chen, Junkun
    Ma, Mingbo
    Zheng, Renjie
    Huang, Liang
    INTERSPEECH 2021, 2021, : 2232 - 2236
  • [10] SimulSpeech: End-to-End Simultaneous Speech to Text Translation
    Ren, Yi
    Liu, Jinglin
    Tan, Xu
    Zhang, Chen
    Qin, Tao
    Zhao, Zhou
    Liu, Tie-Yan
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 3787 - 3796