Improving End-to-End Speech-to-Text Translation With Document-Level Context

被引:0
作者
Tian, Xinyu [1 ]
Wei, Haoran [2 ]
Gong, Zhengxian [1 ]
Li, Junhui [1 ]
Xie, Jun [2 ]
机构
[1] Soochow Univ, Sch Comp Sci & Technol, Suzhou 215006, Peoples R China
[2] Alibaba Grp, Hangzhou 310030, Peoples R China
来源
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2025年 / 33卷
基金
中国国家自然科学基金;
关键词
Translation; Context modeling; Training; Speech to text; Encoding; Decoding; Multitasking; Computer architecture; Logic gates; Data models; Context-aware; document-level context; end-to-end; speech-to-text translation;
D O I
10.1109/TASLPRO.2025.3570951
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In recent years, end-to-end speech-to-text translation (E2E-ST) has emerged as a promising approach. Existing ST models mostly focus on learning from sentence-level speech while neglecting the valuable contextual information carried by document-level speech context. To leverage document-level context, in this paper we propose Context-Aware Speech-to-text Translation (CAST), a context-aware ST model which uses document-level context to enhance the encoding of the current speech sentence under a multi-task training framework. To better leverage the bimodal document-level context during training, on the one hand, we adopt a mixup strategy that mixes up the speech and text representations at sentence-level. On the other hand, we train the model to selectively utilize contextual information by employing a selective strategy. The experimental results from the MuST-C benchmark indicate that CAST significantly enhances the sentence-level baseline, yielding an average BLEU score of 30.4 and a COMET score of 78.7 across the eight translation directions.
引用
收藏
页码:2098 / 2109
页数:12
相关论文
共 78 条
[1]  
Alinejad A, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), P8014
[2]  
Baevski A, 2020, ADV NEUR IN, V33
[3]  
Bahar P., 2019, P IWSLP
[4]   TIGHT INTEGRATED END-TO-END TRAINING FOR CASCADED SPEECH TRANSLATION [J].
Bahar, Parnia ;
Bieschke, Tobias ;
Schlueter, Ralf ;
Ney, Hermann .
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, :950-957
[5]   IMPROVING END-TO-END SPEECH TRANSLATION MODEL WITH BERT-BASED CONTEXTUAL INFORMATION [J].
Bang, Jeong-Uk ;
Lee, Min-Kyu ;
Yun, Seung ;
Kim, Sang-Hun .
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :6227-6231
[6]  
Bansal S, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P58
[7]  
Bao GS, 2021, 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, P3442
[8]  
Berard A, 2016, Arxiv, DOI arXiv:1612.01744
[9]  
Bérard A, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P6224, DOI 10.1109/ICASSP.2018.8461690
[10]  
Cheng X., 2023, P IEEE INT C AC SPEE, P1