ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context

被引：117

作者：

Han, Wei ^{[1
]}

Zhang, Zhengdong ^{[1
]}

Zhang, Yu ^{[1
]}

Yu, Jiahui ^{[1
]}

Chiu, Chung-Cheng ^{[1
]}

Qin, James ^{[1
]}

Gulati, Anmol ^{[1
]}

Pang, Ruoming ^{[1
]}

Wu, Yonghui ^{[1
]}

机构：

[1] Google Inc, Mountain View, CA 94043 USA

来源：

INTERSPEECH 2020 | 2020年

关键词：

speech recognition; convolutional neural networks;

D O I：

10.21437/Interspeech.2020-2059

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Convolutional neural networks (CNN) have shown promising results for end-to-end speech recognition, albeit still behind RNN/transformer based models in performance. In this paper, we study how to bridge this gap and go beyond with a novel CNN-RNN-transducer architecture, which we call ContextNet. ContextNet features a fully convolutional encoder that incorporates global context information into convolution layers by adding squeeze-and-excitation modules. In addition, we propose a simple scaling method that scales the widths of ContextNet that achieves good trade-off between computation and accuracy. We demonstrate that on the widely used Librispeech benchmark, ContextNet achieves a word error rate (WER) of 2.1%/4.6% without external language model (LM), 1.9%/4.1% with LM and 2.9%/7.0% with only 10M parameters on the clean/noisy LibriSpeech test sets. This compares to the best previously published model of 2.0%/4.6% with LM and 3.9%/11.3% with 20M parameters. The superiority of the proposed ContextNet model is also verified on a much larger internal dataset.

引用

页码：3610 / 3614

页数：5

共 36 条

[1] RNN-T MODELS FAIL TO GENERALIZE TO OUT-OF-DOMAIN AUDIO: CAUSES AND SOLUTIONS [J].

Chiu, Chung-Cheng ;

Narayanan, Arun ;

Han, Wei ;

Prabhavalkar, Rohit ;

Zhang, Yu ;

Jaitly, Navdeep ;

Pang, Ruoming ;

Sainath, Tara N. ;

Nguyen, Patrick ;

Cao, Liangliang ;

Wu, Yonghui .

2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, :873-880

[2]

CHOLLET F, 2017, PROC CVPR IEEE, P1800, DOI [DOI 10.1109/CVPR.2017.195, 10.1109/CVPR.2017.195]

[3]

Goodfellow I, 2016, ADAPT COMPUT MACH LE, P1

[4]

Graves A., 2012, Sequence transduction with recurrent neural networks

[5]

Graves A, 2006, Proceedings of the 23rd International Conference on Machine Learning, ICML'06, page, P369, DOI DOI 10.1145/1143844.1143891

[6]

Hannun A., 2019, ARXIV190402619

[7]

He K, 2016, PROC CVPR IEEE, P770, DOI [10.1109/CVPR.2016.90, DOI 10.1109/CVPR.2016.90]

[8]

He YZ, 2019, INT CONF ACOUST SPEE, P6381, DOI [10.1109/ICASSP.2019.8682336, 10.1109/icassp.2019.8682336]

[9]

Hu J., 2018, 32 C NEUR INF PROC S, Vvol 31, P9401, DOI 10.5555/3327546.3327612

[10]

Hu J, 2018, PROC CVPR IEEE, P7132, DOI [10.1109/TPAMI.2019.2913372, 10.1109/CVPR.2018.00745]

← 1 2 3 4 →