Modulation Spectrum Augmentation for Robust Speech Recognition

被引:0
作者
Yan, Bi-Cheng [1 ]
Liu, Shih-Hung [2 ]
Chen, Berlin [3 ]
机构
[1] ASUS, AICS, Taipei, Taiwan
[2] Delta Elect Inc, Delta Management Syst, Taipei, Taiwan
[3] Natl Taiwan Normal Univ, Comp Sci, Taipei, Taiwan
来源
PROCEEDINGS OF THE 1ST INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION SCIENCE AND SYSTEM, AISS 2019 | 2019年
关键词
Speech recognition; Data augmentation; Robustness; Modulation spectra;
D O I
10.1145/3373477.3373695
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Data augmentation is a crucial mechanism being employed to increase the diversity of training data in order to avoid overfitting and improve robustness of statistical models in various applications. In the context of automatic speech recognition (ASR), a recent trend has been to develop effective methods to augment training speech data by warping or masking utterances based on their waveforms or spectrograms. Extending this line of research, we make attempts to explore novel ways to generate augmented training speech data, in comparison to the existing state-of-the-art approaches. The main contribution of this paper is at least two-fold. First, we propose to warp the intermediate representation of the cepstral feature vector sequence of an utterance in a holistic manner. This intermediate representation can be embodied in different modulation domains by performing discrete Fourier transform (DFT) along the either the time- or the component-axis of a cepstral feature vector sequence. Second, we also develop a two-stage augmentation approach, which successively conduct perturbation in the waveform domain and warping in different modulation domains of cepstral speech feature vector sequences, to further enhance robustness. A series of experiments are carried out on the Aurora-4 database and task, in conjunction with a typical DNNHMM based ASR system. The proposed augmentation method that conducts warping in the component-axis modulation domain of cepstral feature vector sequences can yield a word error rate reduction (WERR) of 17.6% and 0.69%, respectively, for the cleanand multi-condition training settings. In addition, the proposed two-stage augmentation method can at best achieve a WERR of 1.13% when using the multi-condition training setup.
引用
收藏
页数:6
相关论文
共 14 条
[1]  
Cho JJ, 2019, INT CONF ACOUST SPEE, P6191, DOI 10.1109/ICASSP.2019.8683380
[2]  
Devries Terrance, 2017, arXiv
[3]   Robust Speech Recognition via Enhancing the Complex-Valued Acoustic Spectrum in Modulation Domain [J].
Hung, Jeih-Weih ;
Hsieh, Hsin-Ju ;
Chen, Berlin .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (02) :236-251
[4]  
Jaderberg M, 2015, ADV NEUR IN, V28
[5]  
Jaitly N., 2013, PROC ICML WORKSHOP D, V117, P1
[6]  
Kanedera N., 1997, P EUR, P1079
[7]  
Ko T, 2017, INT CONF ACOUST SPEE, P5220, DOI 10.1109/ICASSP.2017.7953152
[8]  
Ko T, 2015, 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, P3586
[9]   ImageNet Classification with Deep Convolutional Neural Networks [J].
Krizhevsky, Alex ;
Sutskever, Ilya ;
Hinton, Geoffrey E. .
COMMUNICATIONS OF THE ACM, 2017, 60 (06) :84-90
[10]  
Lippmann R. P., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0), P705