Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM

被引：164

作者：

Hari, Takaaki ^{[1
]}

Watanabe, Shinji ^{[1
]}

Zhang, Yu ^{[2
]}

Chan, William ^{[3
]}

机构：

[1] Mitsubishi Elect Res Labs, Cambridge, MA 02139 USA

[2] MIT, Cambridge, MA 02139 USA

[3] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

来源：

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION | 2017年

关键词：

end-to-end speech recognition; encoder-decoder; connectionist temporal classification; attention model;

D O I：

10.21437/Interspeech.2017-1296

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present a state-of-the-art end-to-end Automatic Speech Recognition (ASR) model. We learn to listen and write characters with a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder network. The encoder is a deep Convolutional Neural Network (CNN) based on the VGG network. The CTC network sits on top of the encoder and is jointly trained with the attention-based decoder. During the beam search process, we combine the CTC predictions. the attention-based decoder predictions and a separately trained LSTM language model. We achieve a 5-10% error reduction compared to prior systems on spontaneous Japanese and Chinese speech, and our end-to-end model beats out traditional hybrid ASR systems.

引用

页码：949 / 953

页数：5

共 29 条

[1]

[Anonymous], 2008, THESIS

[2]

[Anonymous], 2015, Very Deep Convolu- tional Networks for Large-Scale Image Recognition

[3]

[Anonymous], 2015, AUT M ASJ

[4]

[Anonymous], 2015, IEEE INT C AC SPEECH

[5]

[Anonymous], 2017, IEEE INT C AC SPEECH

[6]

[Anonymous], 2015, IEEE INT C AC SPEECH

[7]

[Anonymous], 2011, PROC 2011 WORKSHOP A

[8]

[Anonymous], INTERSPEECH

[9]

[Anonymous], 2012, COMPUTER ENCE

[10]

[Anonymous], 2012, ARXIV12115063

← 1 2 3 →