Deep Convolutional Neural Networks for Large-scale Speech Tasks

被引:360
作者
Sainath, Tara N. [1 ]
Kingsbury, Brian [1 ]
Saon, George [1 ]
Soltau, Hagen [1 ]
Mohamed, Abdel-rahman [2 ]
Dahl, George [2 ]
Ramabhadran, Bhuvana [1 ]
机构
[1] IBM TJ Watson Res Ctr, Yorktown Hts, NY 10598 USA
[2] Univ Toronto, Dept Comp Sci, Toronto, ON M5S 1A1, Canada
关键词
Deep learning; Neural networks; Speech recognition; RECOGNITION;
D O I
10.1016/j.neunet.2014.08.005
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Convolutional Neural Networks (CNNs) are an alternative type of neural network that can be used to reduce spectral variations and model spectral correlations which exist in signals. Since speech signals exhibit both of these properties, we hypothesize that CNNs are a more effective model for speech compared to Deep Neural Networks (DNNs). In this paper, we explore applying CNNs to large vocabulary continuous speech recognition (LVCSR) tasks. First, we determine the appropriate architecture to make CNNs effective compared to DNNs for LVCSR tasks. Specifically, we focus on how many convolutional layers are needed, what is an appropriate number of hidden units, what is the best pooling strategy. Second, investigate how to incorporate speaker-adapted features, which cannot directly be modeled by CNNs as they do not obey locality in frequency, into the CNN framework. Third, given the importance of sequence training for speech tasks, we introduce a strategy to use ReLU+dropout during Hessian-free sequence training of CNNs. Experiments on 3 LVCSR tasks indicate that a CNN with the proposed speaker-adapted and ReLU+dropout ideas allow for a 12%-14% relative improvement in WER over a strong DNN system, achieving state-of-the art results in these 3 tasks. (C) 2014 Elsevier Ltd. All rights reserved.
引用
收藏
页码:39 / 48
页数:10
相关论文
共 34 条
[1]  
Abdel-Hamid O., 2012, PROC ICASSP
[2]  
[Anonymous], 2012, ADVANCES IN NEURAL I
[3]  
Bourlard H. A., 1993, Connectionist Speech Recognition: a Hybrid Approach, DOI 10.1007/978-1-4615-3210-1
[4]  
Dahl G., 2013, PROC ICASSP
[5]   Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition [J].
Dahl, George E. ;
Yu, Dong ;
Deng, Li ;
Acero, Alex .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01) :30-42
[6]  
Deng L., 2013, PROC ICASSP
[7]   Maximum likelihood linear transformations for HMM-based speech recognition [J].
Gales, MJF .
COMPUTER SPEECH AND LANGUAGE, 1998, 12 (02) :75-98
[8]   Semi-tied covariance matrices for hidden Markov models [J].
Gales, MJF .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1999, 7 (03) :272-281
[9]  
Glorot X., PROC AI STATS
[10]  
Hinton G., 2012, THE COMPUTING RESEAR, V1207