RELAXED ATTENTION: A SIMPLE METHOD TO BOOST PERFORMANCE OF END-TO-END AUTOMATIC SPEECH RECOGNITION

被引:5
作者
Lohrenz, Timo [1 ]
Schwarz, Patrick [1 ]
Li, Zhengyang [1 ]
Fingscheidt, Tim [1 ]
机构
[1] Tech Univ Carolo Wilhelmina Braunschweig, Inst Commun Technol, Schlcinitzstr 22, D-38106 Braunschweig, Germany
来源
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU) | 2021年
关键词
End-to-end speech recognition; encoder-decoder models; relaxed attention; speech transformer; TRANSFORMER;
D O I
10.1109/ASRU51503.2021.9688298
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, attention-based encoder-decoder (AED) models have shown high performance for end-to-end automatic speech recognition (ASR) across several tasks. Addressing overconfidence in such models, in this paper we introduce the concept of relaxed attention, which is a simple gradual injection of a uniform distribution to the encoder-decoder attention weights during training that is easily implemented with two lines of code. We investigate the effect of relaxed attention across different AED model architectures and two prominent ASR tasks, Wall Street Journal (WSJ) and Librispeech. We found that transformers trained with relaxed attention outperform the standard baseline models consistently during decoding with external language models. On WSJ, we set a new benchmark for transformer-based end-to-end speech recognition with a word error rate of 3.65%, outperforming state of the art (4.20%) by 13.1% relative, while introducing only a single hyperparameter.
引用
收藏
页码:177 / 184
页数:8
相关论文
共 42 条
[1]   Comparing Fusion Models for DNN-Based Audiovisual Continuous Speech Recognition [J].
Abdelaziz, Ahmed Hussen .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (03) :475-484
[2]   Learning Dynamic Stream Weights For Coupled-HMM-Based Audio-Visual Speech Recognition [J].
Abdelaziz, Ahmed Hussen ;
Zeiler, Steffen ;
Kolossa, Dorothea .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2015, 23 (05) :863-876
[3]  
Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, DOI 10.48550/ARXIV.1409.0473]
[4]  
Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
[5]   FOCUS ON THE PRESENT: A REGULARIZATION METHOD FOR THE ASR SOURCE-TARGET ATTENTION LAYER [J].
Chen, Nanxin ;
Zelasko, Piotr ;
Villalba, Jesus ;
Dehak, Najim .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :5994-5998
[6]  
Chiu CC, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4774, DOI 10.1109/ICASSP.2018.8462105
[7]  
Cho K., 2014, P 8 WORKSH SYNT SEM, DOI [10.3115/v1/W14-4012, DOI 10.3115/V1/W14-4012]
[8]  
Chorowski J, 2015, ADV NEUR IN, V28
[9]   Towards better decoding and language model integration in sequence to sequence models [J].
Chorowski, Jan ;
Jaitly, Navdeep .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :523-527
[10]  
Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506