Cross Attention with Monotonic Alignment for Speech Transformer

被引：4

作者：

Zhao, Yingzhu ^{[1
,2
,3
]}

Ni, Chongjia ^{[2
]}

Leung, Cheung-Chi ^{[2
]}

Joty, Shafiq ^{[1
]}

Chng, Eng Siong ^{[1
]}

Ma, Bin ^{[2
]}

机构：

[1] Nanyang Technol Univ, Singapore, Singapore

[2] Alibaba Grp, Machine Intelligence Technol, Hangzhou, Peoples R China

[3] Joint PhD Program Alibaba & Nanyang Technol Univ, Singapore, Singapore

来源：

INTERSPEECH 2020 | 2020年

关键词：

speech recognition; end-to-end; transformer; alignment; cross attention; HIDDEN MARKOV-MODELS;

D O I：

10.21437/Interspeech.2020-1198

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Transformer, a state-of-the-art neural network architecture, has been used successfully for different sequence-to-sequence transformation tasks. This model architecture disperses the attention distribution over entire input to learn long-term dependencies, which is important for some sequence-to-sequence tasks, such as neural machine translation and text summarization. However, automatic speech recognition (ASR) has a characteristic to have monotonic alignment between text output and speech input. Techniques like Connectionist Temporal Classification (CTC), RNN Transducer (RNN-T) and Recurrent Neural Aligner (RNA) build on top of this monotonic alignment and use local encoded speech representations for corresponding token prediction. In this paper, we present an effective cross attention biasing technique in transformer that takes monotonic alignment between text output and speech input into consideration by making use of cross attention weights. Specifically, a Gaussian mask is applied on cross attention weights to limit the input speech context range locally given alignment information. We further introduce a regularizer for alignment regularization. Experiments on LibriSpeech dataset find that our proposed model can obtain improved output-input alignment for ASR, and yields 14.5%-25.0% relative word error rate (WER) reductions.

引用

页码：5031 / 5035

页数：5

共 50 条

[31] TSMCF: Transformer-Based SAR and Multispectral Cross-Attention Fusion for Cloud Removal
Zhu, Hongming
Wang, Zeju
Han, Letong
Xu, Manxin
Li, Weiqi
Liu, Qin
Liu, Sicong
Du, Bowen
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2025, 18 : 6710 - 6720
[32] A Cross-Attention-Based Multi-Information Fusion Transformer for Hyperspectral Image Classification
Yang, Jinghui
Li, Anqi
Qian, Jinxi
Qin, Jia
Wang, Liguo
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 13358 - 13375
[33] Transformer-based Cross attention and Feature Diversity for Occluded Person Re-identification
Kang S.
Kim S.
Seo K.
Transactions of the Korean Institute of Electrical Engineers, 2023, 72 (01): : 108 - 113
[34] AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio-Visual Speech Recognition
Che, Na
Zhu, Yiming
Wang, Haiyan
Zeng, Xianwei
Du, Qinsheng
APPLIED SCIENCES-BASEL, 2025, 15 (01):
[35] Cross-Attention Transformer-Based Domain Adaptation: A Novel Method for Fault Diagnosis of Rotating Machinery With High Generalizability and Alignment Capability
Yin, Hua
Chen, Qitong
Chen, Liang
Shen, Changqing
IEEE SENSORS JOURNAL, 2024, 24 (23) : 40049 - 40058
[36] A Dual Cross Attention Transformer Network for Infrared and Visible Image Fusion
Zhou, Zhuozhi
Lan, Jinhui
2024 7TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND BIG DATA, ICAIBD 2024, 2024, : 494 - 499
[37] DCAT: Dual Cross-Attention-Based Transformer for Change Detection
Zhou, Yuan
Huo, Chunlei
Zhu, Jiahang
Huo, Leigang
Pan, Chunhong
REMOTE SENSING, 2023, 15 (09)
[38] Layer Sparse Transformer for Speech Recognition
Wang, Peng
Guo, Zhiyuan
Xie, Fei
2023 IEEE INTERNATIONAL CONFERENCE ON KNOWLEDGE GRAPH, ICKG, 2023, : 269 - 273
[39] Semantic-alignment transformer and adversary hashing for cross-modal retrieval
Sun, Yajun
Wang, Meng
Ma, Ying
APPLIED INTELLIGENCE, 2024, 54 (17-18) : 7581 - 7602
[40] Multi-Feature Cross Attention-Induced Transformer Network for Hyperspectral and LiDAR Data Classification
Li, Zirui
Liu, Runbang
Sun, Le
Zheng, Yuhui
REMOTE SENSING, 2024, 16 (15)

← 1 2 3 4 5 →