Cross Attention with Monotonic Alignment for Speech Transformer

被引：4

作者：

Zhao, Yingzhu ^{[1
,2
,3
]}

Ni, Chongjia ^{[2
]}

Leung, Cheung-Chi ^{[2
]}

Joty, Shafiq ^{[1
]}

Chng, Eng Siong ^{[1
]}

Ma, Bin ^{[2
]}

机构：

[1] Nanyang Technol Univ, Singapore, Singapore

[2] Alibaba Grp, Machine Intelligence Technol, Hangzhou, Peoples R China

[3] Joint PhD Program Alibaba & Nanyang Technol Univ, Singapore, Singapore

来源：

INTERSPEECH 2020 | 2020年

关键词：

speech recognition; end-to-end; transformer; alignment; cross attention; HIDDEN MARKOV-MODELS;

D O I：

10.21437/Interspeech.2020-1198

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Transformer, a state-of-the-art neural network architecture, has been used successfully for different sequence-to-sequence transformation tasks. This model architecture disperses the attention distribution over entire input to learn long-term dependencies, which is important for some sequence-to-sequence tasks, such as neural machine translation and text summarization. However, automatic speech recognition (ASR) has a characteristic to have monotonic alignment between text output and speech input. Techniques like Connectionist Temporal Classification (CTC), RNN Transducer (RNN-T) and Recurrent Neural Aligner (RNA) build on top of this monotonic alignment and use local encoded speech representations for corresponding token prediction. In this paper, we present an effective cross attention biasing technique in transformer that takes monotonic alignment between text output and speech input into consideration by making use of cross attention weights. Specifically, a Gaussian mask is applied on cross attention weights to limit the input speech context range locally given alignment information. We further introduce a regularizer for alignment regularization. Experiments on LibriSpeech dataset find that our proposed model can obtain improved output-input alignment for ASR, and yields 14.5%-25.0% relative word error rate (WER) reductions.

引用

页码：5031 / 5035

页数：5

共 50 条

[1] LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition
Fu, Pengbin
Liu, Daxing
Yang, Huirong
INFORMATION, 2022, 13 (05)
[2] TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION WITH LOCAL DENSE SYNTHESIZER ATTENTION
Xu, Menglong
Li, Shengqiang
Zhang, Xiao-Lei
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5899 - 5903
[3] Monotonic Gaussian regularization of attention for robust automatic speech recognition
Du, Yeqian
Wu, Minghui
Fang, Xin
Yang, Zhouwang
COMPUTER SPEECH AND LANGUAGE, 2023, 77
[4] Deformable Cross-Attention Transformer for Medical Image Registration
Chen, Junyu
Liu, Yihao
He, Yufan
Du, Yong
MACHINE LEARNING IN MEDICAL IMAGING, MLMI 2023, PT I, 2024, 14348 : 115 - 125
[5] EXPLICIT ALIGNMENT OF TEXT AND SPEECH ENCODINGS FOR ATTENTION-BASED END-TO-END SPEECH RECOGNITION
Drexler, Jennifer
Glass, James
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 913 - 919
[6] RAT: RNN-Attention Transformer for Speech Enhancement
Zhang, Tailong
He, Shulin
Li, Hao
Zhang, Xueliang
2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 463 - 467
[7] Multi-task Learning with Auxiliary Cross-attention Transformer for Low-Resource Multi-dialect Speech Recognition
Dan, Zhengjia
Zhao, Yue
Bi, Xiaojun
Wu, Licheng
Ji, Qiang
NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2022, PT I, 2022, 13551 : 107 - 118
[8] Unsupervised Domain Adaptation via Bidirectional Cross-Attention Transformer
Wang, Xiyu
Guo, Pengxin
Zhang, Yu
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: RESEARCH TRACK, ECML PKDD 2023, PT V, 2023, 14173 : 309 - 325
[9] Learning Cross-Attention Point Transformer With Global Porous Sampling
Duan, Yueqi
Sun, Haowen
Yan, Juncheng
Lu, Jiwen
Zhou, Jie
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 6283 - 6297
[10] Weak-Attention Suppression For Transformer Based Speech Recognition
Shi, Yangyang
Wang, Yongqiang
Wu, Chunyang
Fuegen, Christian
Zhang, Frank
Le, Duc
Yeh, Ching-Feng
Seltzer, Michael L.
INTERSPEECH 2020, 2020, : 4996 - 5000

← 1 2 3 4 5 →