Acoustic modelling from the signal domain using CNNs

被引:36
作者
Ghahremanil, Pegah [1 ]
Manoharl, Vimal [1 ]
Povey, Daniel [1 ,2 ]
Khudanpur, Sanjeev [1 ,2 ]
机构
[1] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
[2] Johns Hopkins Univ, Human Language Technol Ctr Excelence, Baltimore, MD USA
来源
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES | 2016年
基金
美国国家科学基金会;
关键词
raw waveform; statistic extraction layer; Network In Network nonlinearity;
D O I
10.21437/Interspeech.2016-1495
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Most speech recognition systems use spectral features based on fixed filters, such as MFCC and PLP. In this paper, we show that it is possible to achieve state of the art results by making the feature extractor a part of the network and jointly optimizing it with,the rest of the network. The basic approach is to start with a convolutional layer that operates on the signal (say, with a step size of 1.25 milliseconds), and aggregate the filter outputs over a portion of the time axis using a network in network architecture, and then down-sample to every 10 milliseconds for use by the rest of the network. We find that, unlike some previous work on learned feature extractors, the objective function converges as fast as for a network based on traditional features. Because we found that iVector adaptation is less effective in this framework, we also experiment with a different adaptation method that is part of the network, where activation statistics over a medium time span (around a second) are computed at intermediate layers. We find that the resulting 'direct-from signal' network is competitive with our state of the art networks based on conventional features with iVector adaptation.
引用
收藏
页码:3434 / 3438
页数:5
相关论文
共 25 条
[1]   Deep Scattering Spectrum [J].
Anden, Joakim ;
Mallat, Stephane .
IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2014, 62 (16) :4114-4128
[2]  
[Anonymous], 2015, P INT
[3]  
[Anonymous], 2014, P INTERSPEECH
[4]  
[Anonymous], 2015, ASRU
[5]  
[Anonymous], INTERSPEECH UNPUB
[6]  
Bhargava M, 2015, 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, P6
[7]   Front-End Factor Analysis for Speaker Verification [J].
Dehak, Najim ;
Kenny, Patrick J. ;
Dehak, Reda ;
Dumouchel, Pierre ;
Ouellet, Pierre .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (04) :788-798
[8]   Mean and variance adaptation within the MLLR framework [J].
Gales, MJF ;
Woodland, PC .
COMPUTER SPEECH AND LANGUAGE, 1996, 10 (04) :249-264
[9]  
Godfrey J. J., 1992, ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech and Signal Processing (Cat. No.92CH3103-9), P517, DOI 10.1109/ICASSP.1992.225858
[10]  
Golik P., 2015, 16 ANN C INT SPEECH