Conformer: Convolution-augmented Transformer for Speech Recognition

被引:1823
作者
Gulati, Anmol [1 ]
Qin, James [1 ]
Chiu, Chung-Cheng [1 ]
Parmar, Niki [1 ]
Zhang, Yu [1 ]
Yu, Jiahui [1 ]
Han, Wei [1 ]
Wang, Shibo [1 ]
Zhang, Zhengdong [1 ]
Wu, Yonghui [1 ]
Pang, Ruoming [1 ]
机构
[1] Google Inc, Mountain View, CA 94043 USA
来源
INTERSPEECH 2020 | 2020年
关键词
speech recognition; attention; convolutional neural networks; transformer; end-to-end; NEURAL-NETWORKS;
D O I
10.21437/Interspeech.2020-3015
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs). Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively. In this work, we achieve the best of both worlds by studying how to combine convolution neural networks and transformers to model both local and global dependencies of an audio sequence in a parameter-efficient way. To this regard, we propose the convolution-augmented transformer for speech recognition, named Conformer. Conformer significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies. On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother. We also observe competitive performance of 2.7%/6.3% with a small model of only 10M parameters.
引用
收藏
页码:5036 / 5040
页数:5
相关论文
共 35 条
[1]   Convolutional Neural Networks for Speech Recognition [J].
Abdel-Hamid, Ossama ;
Mohamed, Abdel-Rahman ;
Jiang, Hui ;
Deng, Li ;
Penn, Gerald ;
Yu, Dong .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (10) :1533-1545
[2]   Attention Augmented Convolutional Networks [J].
Bello, Irwan ;
Zoph, Barret ;
Vaswani, Ashish ;
Shlens, Jonathon ;
Le, Quoc V. .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :3285-3294
[3]  
Chiu CC, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4774, DOI 10.1109/ICASSP.2018.8462105
[4]  
Dai Z., 2019, TRANSFORMER 40 ATTEN
[5]  
Dauphin YN, 2017, PR MACH LEARN RES, V70
[6]  
Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506
[7]  
Graves A., 2012, Sequence transduction with recurrent neural networks
[8]   ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context [J].
Han, Wei ;
Zhang, Zhengdong ;
Zhang, Yu ;
Yu, Jiahui ;
Chiu, Chung-Cheng ;
Qin, James ;
Gulati, Anmol ;
Pang, Ruoming ;
Wu, Yonghui .
INTERSPEECH 2020, 2020, :3610-3614
[9]  
He YZ, 2019, INT CONF ACOUST SPEE, P6381, DOI [10.1109/icassp.2019.8682336, 10.1109/ICASSP.2019.8682336]
[10]  
Hu J, 2018, PROC CVPR IEEE, P7132, DOI [10.1109/CVPR.2018.00745, 10.1109/TPAMI.2019.2913372]