Low Latency End-to-End Streaming Speech Recognition with a Scout Network

被引:27
作者
Wang, Chengyi [1 ]
Wu, Yu [2 ]
Lu, Liang [3 ]
Liu, Shujie [2 ]
Li, Jinyu [3 ]
Ye, Guoli [3 ]
Zhou, Ming [2 ]
机构
[1] Nankai Univ, Tianjin, Peoples R China
[2] Microsoft Res Asia, Beijing, Peoples R China
[3] Microsoft Speech & Language Grp, Redmond, WA USA
来源
INTERSPEECH 2020 | 2020年
关键词
online speech recognition; adaptive look-ahead; streaming model; ATTENTION;
D O I
10.21437/Interspeech.2020-1292
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
The attention-based Transformer model has achieved promising results for speech recognition (SR) in the offline mode. However, in the streaming mode, the Transformer model usually incurs significant latency to maintain its recognition accuracy when applying a fixed-length look-ahead window in each encoder layer. In this paper, we propose a novel low-latency streaming approach for Transformer models, which consists of a scout network and a recognition network. The scout network detects the whole word boundary without seeing any future frames, while the recognition network predicts the next subword by utilizing the information from all the frames before the predicted boundary. Our model achieves the best performance (2.7/6.4 WER) with only an average of 639 ms latency on the test-clean and test-other data sets of Librispeech.
引用
收藏
页码:2112 / 2116
页数:5
相关论文
共 33 条
[1]  
[Anonymous], 2018, ARXIV181105247
[2]  
[Anonymous], 2014, ARXIV14082873
[3]  
[Anonymous], 2016, ICML 2016
[4]  
[Anonymous], 2019, SYNCHRONOUS TRANSFOR, DOI DOI 10.1109/CVPR.2019.00324
[5]  
Bandanau D, 2016, INT CONF ACOUST SPEE, P4945, DOI 10.1109/ICASSP.2016.7472618
[6]   On Online Attention-based Speech Recognition and Joint Mandarin Character-Pinyin Training [J].
Chan, William ;
Lane, Ian .
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :3404-3408
[7]  
Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
[8]  
Chiu C.-C., 2018, ICLR 2018
[9]  
Chiu CC, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4774, DOI 10.1109/ICASSP.2018.8462105
[10]  
Chorowski J, 2015, ADV NEUR IN, V28