Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

被引:0
作者
Zhao, Chendong [1 ,2 ]
Wang, Jianzong [1 ]
Wei, Wenqi [1 ]
Qu, Xiaoyang [1 ]
Wang, Haoqian [2 ]
Xiao, Jing [1 ]
机构
[1] Ping An Technol Shenzhen Co Ltd, Shenzhen, Peoples R China
[2] Tsinghua Univ, Shenzhen Int Grad Sch, Beijing, Peoples R China
来源
2022 IEEE 9TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA) | 2022年
关键词
Automatic Speech Recognition; Sparse Attention; Monotonic Attention; Self-Attention;
D O I
10.1109/DSAA54385.2022.10032360
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The Transformer architecture model, based on selfattention and multi-head attention, has achieved remarkable success in offline end-to-end Automatic Speech Recognition (ASR). However, self-attention and multi-head attention cannot be easily applied for streaming or online ASR. For self-attention in Transformer ASR, the softmax normalization function-based attention mechanism makes it impossible to highlight important speech information. For multi-head attention in Transformer ASR, it is not easy to model monotonic alignments in different heads. To overcome these two limits, we integrate sparse attention and monotonic attention into Transformer-based ASR. The sparse mechanism introduces a learned sparsity scheme to enable each self-attention structure to fit the corresponding head better. The monotonic attention deploys regularization to prune redundant heads for the multi-head attention structure. The experiments show that our method can effectively improve the attention mechanism on widely used benchmarks of speech recognition.
引用
收藏
页码:173 / 180
页数:8
相关论文
共 49 条
[1]  
[Anonymous], 2018, P 2018 EMNLP WORKSHO
[2]  
Arivazhagan N, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P1313
[3]  
Blondel M, 2019, PR MACH LEARN RES, V89, P606
[4]  
Bu H, 2017, 2017 20TH CONFERENCE OF THE ORIENTAL CHAPTER OF THE INTERNATIONAL COORDINATING COMMITTEE ON SPEECH DATABASES AND SPEECH I/O SYSTEMS AND ASSESSMENT (O-COCOSDA), P58, DOI 10.1109/ICSDA.2017.8384449
[5]  
Cai Y., 2022, European Conference on Computer Vision
[6]  
Cai Y, 2021, 35 C NEURAL INFORM P
[7]  
Cai Y., 2022, P IEEE C COMPUTER VI
[8]  
Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
[9]  
Child R., 2019, ANN C N AM CHAPTER A
[10]  
Chiu C.-C, 2017, INT C LEARNING REPRE