Bridging the Gap Between Monaural Speech Enhancement and Recognition With Distortion-Independent Acoustic Modeling

被引:37
作者
Wang, Peidong [1 ]
Tan, Ke [1 ]
Wang, De Liang [1 ,2 ]
机构
[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
[2] Ohio State Univ, Ctr Cognit & Brain Sci, Columbus, OH 43210 USA
基金
美国国家科学基金会;
关键词
Speech enhancement; Acoustic distortion; Acoustics; Training; Speech recognition; Noise measurement; speech recognition; speech distortion; distortion-independent acoustic modeling; DEEP NEURAL-NETWORK; FRONT-END; SEPARATION; NOISE;
D O I
10.1109/TASLP.2019.2946789
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Monaural speech enhancement has made dramatic advances since the introduction of deep learning a few years ago. Although enhanced speech has been demonstrated to have better intelligibility and quality for human listeners, feeding it directly to automatic speech recognition (ASR) systems trained with noisy speech has not produced expected improvements in ASR performance. The lack of an enhancement benefit on recognition, or the gap between monaural speech enhancement and recognition, is often attributed to speech distortions introduced in the enhancement process. In this article, we analyze the distortion problem, compare different acoustic models, and investigate a distortion-independent training scheme for monaural speech recognition. Experimental results suggest that distortion-independent acoustic modeling is able to overcome the distortion problem. Such an acoustic model can also work with speech enhancement models different from the one used during training. Moreover, the models investigated in this paper outperform the previous best system on the CHiME-2 corpus.
引用
收藏
页码:39 / 48
页数:10
相关论文
共 38 条
[1]  
[Anonymous], 2018, 2018 IEEE INT C AC
[2]  
[Anonymous], 2015, LECT NOTES COMPUT SC, DOI DOI 10.1007/978-3-319-22482-4_9
[3]  
[Anonymous], 2018, 2018 IEEE INT C AC
[4]  
Bagchi D, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5609, DOI 10.1109/ICASSP.2018.8462622
[5]   Long short-term memory for speaker generalization in supervised speech separation [J].
Chen, Jitong ;
Wang, DeLiang .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2017, 141 (06) :4705-4714
[6]   Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises [J].
Chen, Jitong ;
Wang, Yuxuan ;
Yoho, Sarah E. ;
Wang, DeLiang ;
Healy, Eric W. .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2016, 139 (05) :2604-2612
[7]  
Du J, 2014, INTERSPEECH, P616
[8]   SPEECH ENHANCEMENT USING A MINIMUM MEAN-SQUARE ERROR SHORT-TIME SPECTRAL AMPLITUDE ESTIMATOR [J].
EPHRAIM, Y ;
MALAH, D .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1984, 32 (06) :1109-1121
[9]  
Erdogan H, 2015, INT CONF ACOUST SPEE, P708, DOI 10.1109/ICASSP.2015.7178061
[10]   SNR-Based Progressive Learning of Deep Neural Network for Speech Enhancement [J].
Gao, Tian ;
Du, Jun ;
Dai, Li-Rong ;
Lee, Chin-Hui .
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :3713-3717