Improving End-to-End Speech-to-Text Translation With Document-Level Context

被引：0

作者：

Tian, Xinyu ^{[1
]}

Wei, Haoran ^{[2
]}

Gong, Zhengxian ^{[1
]}

Li, Junhui ^{[1
]}

Xie, Jun ^{[2
]}

机构：

[1] Soochow Univ, Sch Comp Sci & Technol, Suzhou 215006, Peoples R China

[2] Alibaba Grp, Hangzhou 310030, Peoples R China

来源：

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2025年 / 33卷

基金：

中国国家自然科学基金;

关键词：

Translation; Context modeling; Training; Speech to text; Encoding; Decoding; Multitasking; Computer architecture; Logic gates; Data models; Context-aware; document-level context; end-to-end; speech-to-text translation;

D O I：

10.1109/TASLPRO.2025.3570951

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

In recent years, end-to-end speech-to-text translation (E2E-ST) has emerged as a promising approach. Existing ST models mostly focus on learning from sentence-level speech while neglecting the valuable contextual information carried by document-level speech context. To leverage document-level context, in this paper we propose Context-Aware Speech-to-text Translation (CAST), a context-aware ST model which uses document-level context to enhance the encoding of the current speech sentence under a multi-task training framework. To better leverage the bimodal document-level context during training, on the one hand, we adopt a mixup strategy that mixes up the speech and text representations at sentence-level. On the other hand, we train the model to selectively utilize contextual information by employing a selective strategy. The experimental results from the MuST-C benchmark indicate that CAST significantly enhances the sentence-level baseline, yielding an average BLEU score of 30.4 and a COMET score of 78.7 across the eight translation directions.

引用

页码：2098 / 2109

页数：12

共 78 条

[1]

Alinejad A, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), P8014

[2]

Baevski A, 2020, ADV NEUR IN, V33

[3]

Bahar P., 2019, P IWSLP

[4] TIGHT INTEGRATED END-TO-END TRAINING FOR CASCADED SPEECH TRANSLATION [J].

Bahar, Parnia ;

Bieschke, Tobias ;

Schlueter, Ralf ;

Ney, Hermann .

2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, :950-957

[5] IMPROVING END-TO-END SPEECH TRANSLATION MODEL WITH BERT-BASED CONTEXTUAL INFORMATION [J].

Bang, Jeong-Uk ;

Lee, Min-Kyu ;

Yun, Seung ;

Kim, Sang-Hun .

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :6227-6231

[6]

Bansal S, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P58

[7]

Bao GS, 2021, 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, P3442

[8]

Berard A, 2016, Arxiv, DOI arXiv:1612.01744

[9]

Bérard A, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P6224, DOI 10.1109/ICASSP.2018.8461690

[10]

Cheng X., 2023, P IEEE INT C AC SPEE, P1

← 1 2 3 4 5 6 7 8 →