Trainable Dynamic Subsampling for End-to-End Speech Recognition

被引:6
作者
Zhang, Shucong [1 ]
Loweimi, Erfan [1 ]
Xu, Yumo [2 ]
Bell, Peter [1 ]
Renals, Steve [1 ]
机构
[1] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland
[2] Univ Edinburgh, Inst Language Cognit & Computat, Edinburgh, Midlothian, Scotland
来源
INTERSPEECH 2019 | 2019年
基金
英国工程与自然科学研究理事会;
关键词
speech recognition; sequence-to-sequence; attentional encoder-decoder; recurrent neural network;
D O I
10.21437/Interspeech.2019-2778
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Jointly optimised attention-based encoder-decoder models have yielded impressive speech recognition results. The recurrent neural network (RNN) encoder is a key component in such models - it learns the hidden representations of the inputs. However, it is difficult for RNNs to model the long sequences characteristic of speech recognition. To address this, subsampling between stacked recurrent layers of the encoder is commonly employed. This method reduces the length of the input sequence and leads to gains in accuracy. However, static subsampling may both include redundant information and miss relevant information. We propose using a dynamic subsampling RNN (dsRNN) encoder. Unlike a statically subsampled RNN encoder, the dsRNN encoder can learn to skip redundant frames. Furthermore, the skip ratio may vary at different stages of training, thus allowing the encoder to learn the most relevant information for each epoch. Although the dsRNN is unidirectional, it yields lower phone error rates (PERs) than a bidirectional RNN on TIMIT. The dsRNN encoder has a 16.8% PER on the TIMIT test set, a considerable improvement over static subsampling methods used with unidirectional and bidirectional RNN encoders (23.5% and 20.4% PER respectively).
引用
收藏
页码:1413 / 1417
页数:5
相关论文
共 28 条
  • [1] [Anonymous], 2016, ARXIV160300223
  • [2] Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
  • [3] Bandanau D, 2016, INT CONF ACOUST SPEE, P4945, DOI 10.1109/ICASSP.2016.7472618
  • [4] LEARNING LONG-TERM DEPENDENCIES WITH GRADIENT DESCENT IS DIFFICULT
    BENGIO, Y
    SIMARD, P
    FRASCONI, P
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS, 1994, 5 (02): : 157 - 166
  • [5] Bengio Y., 2013, arXiv
  • [6] Campos V., 2017, ARXIV170806834
  • [7] Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
  • [8] Chiu CC, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4774, DOI 10.1109/ICASSP.2018.8462105
  • [9] Chorowski J. K, 2015, ADV NEURAL INFORM PR, V1, P577, DOI DOI 10.1016/0167-739X(94)90007-8
  • [10] Chung J, 2014, ARXIV