Two-Stage Enhancement of Noisy and Reverberant Microphone Array Speech for Automatic Speech Recognition Systems Trained with Only Clean Speech

被引:0
作者
Wang, Quandong [1 ,2 ,3 ]
Wang, Sicheng [3 ]
Ge, Fengpei [4 ]
Han, Chang Woo [5 ]
Lee, Jaewon [5 ]
Guo, Lianghao [2 ]
Lee, Chin-Hui [3 ]
机构
[1] Univ Chinese Acad Sci, Beijing, Peoples R China
[2] Inst Acoust, State Key Lab Acoust, Beijing, Peoples R China
[3] Georgia Inst Technol, Sch Elect & Comp Engn, Atlanta, GA 30332 USA
[4] Inst Acoust, Key Lab Speech Acoust & Content Understanding, Beijing, Peoples R China
[5] Samsung Elect, Samsung Res, Seoul, South Korea
来源
2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP) | 2018年
关键词
speech enhancement; multiple interferences; multi-channel processing; deep learning; speech recognition; ROBUST; PERSPECTIVE; BEAMFORMER; REGRESSION; INSIGHTS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a two-stage approach to enhancement of far-field microphone array speech collected in reverberant conditions corrupted by interfering speakers and noises. We intend to produce top-quality enhanced speech to be used by a black-box automatic speech recognition (ASR) system already trained with clean speech. We explore different deep neural network (DNN) architectures and the best configuration comprises two stages. First, in pre-enhancement, we utilize features in temporal context in a subset of microphones to perform enhancement. Second, in integration, we concatenate the enhanced and noisy features from all microphones to estimate anechoic speech of a reference channel as the overall output. Tested on eight speakers, each with 5 minutes of speech for DNN training, from the Wall Street Journal corpus, at a signal to -interference-plus-noise-ratio level of 5-15dB, at a distance of 1-5m and a reverberation time of 0.2-0.3s, our best 8-channel, speaker-dependent enhancement system attains a perceptual evaluation of speech quality score of 2.95, up from 2.43 for our single-channel system. Followed by speaker-independent ASR for a 230K-word recognition task, we achieve a word error rate of 6.56%, down from 17.89% for enhanced speech of the single-channel system, and from 48.47% for unprocessed noisy speech of the reference channel.
引用
收藏
页码:21 / 25
页数:5
相关论文
共 48 条
[1]  
[Anonymous], 2013, COMPUT REV
[2]  
[Anonymous], 2004, OPTIMUM ARRAY PROCES, DOI DOI 10.1002/0471221104
[3]  
[Anonymous], 2009, Distant Speech Recognition
[4]  
[Anonymous], 2004, 100 NONSPEECH ENV SO
[5]  
Benesty J, 2008, SPRINGER TOP SIGN PR, V1, P1
[6]   SPATIAL SPECTRAL FILTERING WITH LINEARLY CONSTRAINED MINIMUM VARIANCE BEAMFORMERS [J].
BUCKLEY, KM .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1987, 35 (03) :249-266
[7]   HIGH-RESOLUTION FREQUENCY-WAVENUMBER SPECTRUM ANALYSIS [J].
CAPON, J .
PROCEEDINGS OF THE IEEE, 1969, 57 (08) :1408-&
[8]   New insights into the noise reduction Wiener filter [J].
Chen, Jingdong ;
Benesty, Jacob ;
Huang, Yiteng ;
Doclo, Simon .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (04) :1218-1234
[9]   Performance Analysis of Multichannel Wiener Filter-Based Noise Reduction in Hearing Aids Under Second Order Statistics Estimation Errors [J].
Cornelis, Bram ;
Moonen, Marc ;
Wouters, Jan .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (05) :1368-1381
[10]   WEIGHTED OVERLAP-ADD METHOD OF SHORT-TIME FOURIER ANALYSIS-SYNTHESIS [J].
CROCHIERE, RE .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1980, 28 (01) :99-102