FAR-FIELD SPEECH RECOGNITION USING CNN-DNN-HMM WITH CONVOLUTION IN TIME

被引:0
作者
Yoshioka, Takuya [1 ]
Karita, Shigeki [1 ,2 ]
Nakatani, Tomohiro [1 ]
机构
[1] NTT Corp, NTT Commun Sci Labs, Kyoto, Japan
[2] Osaka Univ, Grad Sch Engn, Osaka, Japan
来源
2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP) | 2015年
关键词
Far-field speech recognition; reverberation; convolutional neural network; deep neural network; NEURAL-NETWORKS;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recent studies in speech recognition have shown that the performance of convolutional neural networks (CNNs) is superior to that of fully connected deep neural networks (DNNs). In this paper, we explore the use of CNNs in far-field speech recognition for dealing with reverberation, which blurs spectral energies along the time axis. Unlike most previous CNN applications to speech recognition, we consider convolution in time to examine whether it provides an improved reverberation modelling capability. Experimental results show that a CNN coupled with a fully connected DNN can model short time correlations in feature vectors with fewer parameters than a DNN and thus generalise better to unseen test environments. Combining this approach with signal-space dereverberation, which copes with long-term correlations, is shown to result in further improvement, where the gains from both approaches are almost additive. An initial investigation of the use of restricted convolution forms is also undertaken.
引用
收藏
页码:4360 / 4364
页数:5
相关论文
共 29 条
  • [1] Abdel-Hamid O, 2013, INTERSPEECH, P3365
  • [2] Convolutional Neural Networks for Speech Recognition
    Abdel-Hamid, Ossama
    Mohamed, Abdel-Rahman
    Jiang, Hui
    Deng, Li
    Penn, Gerald
    Yu, Dong
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (10) : 1533 - 1545
  • [3] Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
    Dahl, George E.
    Yu, Dong
    Deng, Li
    Acero, Alex
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01): : 30 - 42
  • [4] Delcroix M., 2014, Proceedings of REVERB Challenge Workshop
  • [5] Gales M. J. F., 2011, 2011 Joint Workshop on Hands-Free Speech Communication and Microphone Arrays (HSCMA 2011), P121, DOI 10.1109/HSCMA.2011.5942377
  • [6] Habets E. A. P., 2006, THESIS
  • [7] Deep Neural Networks for Acoustic Modeling in Speech Recognition
    Hinton, Geoffrey
    Deng, Li
    Yu, Dong
    Dahl, George E.
    Mohamed, Abdel-rahman
    Jaitly, Navdeep
    Senior, Andrew
    Vanhoucke, Vincent
    Patrick Nguyen
    Sainath, Tara N.
    Kingsbury, Brian
    [J]. IEEE SIGNAL PROCESSING MAGAZINE, 2012, 29 (06) : 82 - 97
  • [8] Kinoshita T, 2013, 2013 9TH INTERNATIONAL WORKSHOP ON ELECTROMAGNETIC COMPATIBILITY OF INTEGRATED CIRCUITS (EMC COMPO 2013), P1, DOI 10.1109/EMCCompo.2013.6735162
  • [9] Knill KM, 2013, 2013 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), P138, DOI 10.1109/ASRU.2013.6707719
  • [10] Model-Based Feature Enhancement for Reverberant Speech Recognition
    Krueger, Alexander
    Haeb-Umbach, Reinhold
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2010, 18 (07): : 1692 - 1707