Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM

被引:157
作者
Hari, Takaaki [1 ]
Watanabe, Shinji [1 ]
Zhang, Yu [2 ]
Chan, William [3 ]
机构
[1] Mitsubishi Elect Res Labs, Cambridge, MA 02139 USA
[2] MIT, Cambridge, MA 02139 USA
[3] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
来源
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION | 2017年
关键词
end-to-end speech recognition; encoder-decoder; connectionist temporal classification; attention model;
D O I
10.21437/Interspeech.2017-1296
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a state-of-the-art end-to-end Automatic Speech Recognition (ASR) model. We learn to listen and write characters with a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder network. The encoder is a deep Convolutional Neural Network (CNN) based on the VGG network. The CTC network sits on top of the encoder and is jointly trained with the attention-based decoder. During the beam search process, we combine the CTC predictions. the attention-based decoder predictions and a separately trained LSTM language model. We achieve a 5-10% error reduction compared to prior systems on spontaneous Japanese and Chinese speech, and our end-to-end model beats out traditional hybrid ASR systems.
引用
收藏
页码:949 / 953
页数:5
相关论文
共 29 条
  • [21] Maximum A Posteriori based Decoding for CTC Acoustic Models
    Kanda, Naoyuki
    Lu, Xugang
    Kawai, Hisashi
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 1868 - 1872
  • [22] Kim Suyoun., 2016, CoRR
  • [23] Liu Y, 2006, LECT NOTES COMPUT SC, V4274, P724
  • [24] Lu L, 2016, INT CONF ACOUST SPEE, P5060, DOI 10.1109/ICASSP.2016.7472641
  • [25] Maekawa K., 2000, P LREC, V6, P1
  • [26] Miao YJ, 2016, INT CONF ACOUST SPEE, P2623, DOI 10.1109/ICASSP.2016.7472152
  • [27] Miao YJ, 2015, 2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), P167, DOI 10.1109/ASRU.2015.7404790
  • [28] Purely sequence-trained neural networks for ASR based on lattice-free MMI
    Povey, Daniel
    Peddinti, Vijayaditya
    Galvez, Daniel
    Ghahremani, Pegah
    Manohar, Vimal
    Na, Xingyu
    Wang, Yiming
    Khudanpur, Sanjeev
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2751 - 2755
  • [29] Tokui S., 2015, Chainer: a Next-Generation Open Source Framework for Deep Learning