ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition

被引:6
作者
Pan, Jing [1 ]
Shapiro, Joshua [1 ]
Wohlwend, Jeremy [1 ]
Han, Kyu J. [1 ]
Lei, Tao [1 ]
Ma, Tao [1 ]
机构
[1] ASAPP Inc, New York, NY 10007 USA
来源
INTERSPEECH 2020 | 2020年
关键词
speech recognition; state-of-the-art; LibriSpeech; multistream CNN; self-attentive SRU; NEURAL-NETWORKS;
D O I
10.21437/Interspeech.2020-2947
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
In this paper we present state-of-the-art (SOTA) performance on the LibriSpeech corpus with two novel neural network architectures, a multistream CNN for acoustic modeling and a self-attentive simple recurrent unit (SRU) for language modeling. In the hybrid ASR framework, the multistream CNN acoustic model processes an input of speech frames in multiple parallel pipelines where each stream has a unique dilation rate for diversity. Trained with the SpecAugment data augmentation method, it achieves relative word error rate (WER) improvements of 4% on test-clean and 14% on test-other. We further improve the performance via N-best rescoring using a 24-layer self-attentive SRU language model, achieving WERs of 1.75% on test-clean and 4.46% on test-other.
引用
收藏
页码:16 / 20
页数:5
相关论文
共 40 条
[1]  
[Anonymous], 2018, EMNLP
[2]  
[Anonymous], 2014, REFERENCE REV
[3]   Joint-sequence models for grapheme-to-phoneme conversion [J].
Bisani, Maximilian ;
Ney, Hermann .
SPEECH COMMUNICATION, 2008, 50 (05) :434-451
[4]  
Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
[5]  
Dauphin YN, 2017, PR MACH LEARN RES, V70
[6]   Maximum likelihood linear transformations for HMM-based speech recognition [J].
Gales, MJF .
COMPUTER SPEECH AND LANGUAGE, 1998, 12 (02) :75-98
[7]  
Han K., 2020, INTERSPEECH
[8]  
Han KJ, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P54, DOI [10.1109/ASRU46091.2019.9003730, 10.1109/asru46091.2019.9003730]
[9]   Densely Connected Networks for Conversational Speech Recognition [J].
Han, Kyu J. ;
Chandrashekaran, Akshay ;
Kim, Jungsuk ;
Lane, Ian .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :796-800
[10]   Deep Learning-based Telephony Speech Recognition in the Wild [J].
Han, Kyu J. ;
Hahm, Seongjun ;
Kim, Byung-Hak ;
Kim, Jungsuk ;
Lane, Ian .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :1323-1327