ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context

被引：117

作者：

Han, Wei ^{[1
]}

Zhang, Zhengdong ^{[1
]}

Zhang, Yu ^{[1
]}

Yu, Jiahui ^{[1
]}

Chiu, Chung-Cheng ^{[1
]}

Qin, James ^{[1
]}

Gulati, Anmol ^{[1
]}

Pang, Ruoming ^{[1
]}

Wu, Yonghui ^{[1
]}

机构：

[1] Google Inc, Mountain View, CA 94043 USA

来源：

INTERSPEECH 2020 | 2020年

关键词：

speech recognition; convolutional neural networks;

D O I：

10.21437/Interspeech.2020-2059

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Convolutional neural networks (CNN) have shown promising results for end-to-end speech recognition, albeit still behind RNN/transformer based models in performance. In this paper, we study how to bridge this gap and go beyond with a novel CNN-RNN-transducer architecture, which we call ContextNet. ContextNet features a fully convolutional encoder that incorporates global context information into convolution layers by adding squeeze-and-excitation modules. In addition, we propose a simple scaling method that scales the widths of ContextNet that achieves good trade-off between computation and accuracy. We demonstrate that on the widely used Librispeech benchmark, ContextNet achieves a word error rate (WER) of 2.1%/4.6% without external language model (LM), 1.9%/4.1% with LM and 2.9%/7.0% with only 10M parameters on the clean/noisy LibriSpeech test sets. This compares to the best previously published model of 2.0%/4.6% with LM and 3.9%/11.3% with 20M parameters. The superiority of the proposed ContextNet model is also verified on a much larger internal dataset.

引用

页码：3610 / 3614

页数：5

共 36 条

[21] SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition [J].

Park, Daniel S. ;

Chan, William ;

Zhang, Yu ;

Chiu, Chung-Cheng ;

Zoph, Barret ;

Cubuk, Ekin D. ;

Le, Quoc, V .

INTERSPEECH 2019, 2019, :2613-2617

[22]

Peddinti V, 2015, 2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), P539, DOI 10.1109/ASRU.2015.7404842

[23]

Ramachandran P., 2017, Searching for activation functions

[24]

Rao K, 2017, 2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), P193, DOI 10.1109/ASRU.2017.8268935

[25]

Sailor HB, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P980, DOI [10.1109/ASRU46091.2019.9003755, 10.1109/asru46091.2019.9003755]

[26]

Sainath TN, 2020, INT CONF ACOUST SPEE, P6059, DOI [10.1109/ICASSP40776.2020.9054188, 10.1109/icassp40776.2020.9054188]

[27] MobileNetV2: Inverted Residuals and Linear Bottlenecks [J].

Sandler, Mark ;

Howard, Andrew ;

Zhu, Menglong ;

Zhmoginov, Andrey ;

Chen, Liang-Chieh .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4510-4520

[28]

Saon G, 2013, 2013 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), P55, DOI 10.1109/ASRU.2013.6707705

[29]

Shen Jonathan, 2019, Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling

[30]

Synnaeve G., 2019, END END ASR SUPERVIS

← 1 2 3 4 →