Two-Stage Enhancement of Noisy and Reverberant Microphone Array Speech for Automatic Speech Recognition Systems Trained with Only Clean Speech

被引：0

作者：

Wang, Quandong ^{[1
,2
,3
]}

Wang, Sicheng ^{[3
]}

Ge, Fengpei ^{[4
]}

Han, Chang Woo ^{[5
]}

Lee, Jaewon ^{[5
]}

Guo, Lianghao ^{[2
]}

Lee, Chin-Hui ^{[3
]}

机构：

[1] Univ Chinese Acad Sci, Beijing, Peoples R China

[2] Inst Acoust, State Key Lab Acoust, Beijing, Peoples R China

[3] Georgia Inst Technol, Sch Elect & Comp Engn, Atlanta, GA 30332 USA

[4] Inst Acoust, Key Lab Speech Acoust & Content Understanding, Beijing, Peoples R China

[5] Samsung Elect, Samsung Res, Seoul, South Korea

来源：

2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP) | 2018年

关键词：

speech enhancement; multiple interferences; multi-channel processing; deep learning; speech recognition; ROBUST; PERSPECTIVE; BEAMFORMER; REGRESSION; INSIGHTS;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We propose a two-stage approach to enhancement of far-field microphone array speech collected in reverberant conditions corrupted by interfering speakers and noises. We intend to produce top-quality enhanced speech to be used by a black-box automatic speech recognition (ASR) system already trained with clean speech. We explore different deep neural network (DNN) architectures and the best configuration comprises two stages. First, in pre-enhancement, we utilize features in temporal context in a subset of microphones to perform enhancement. Second, in integration, we concatenate the enhanced and noisy features from all microphones to estimate anechoic speech of a reference channel as the overall output. Tested on eight speakers, each with 5 minutes of speech for DNN training, from the Wall Street Journal corpus, at a signal to -interference-plus-noise-ratio level of 5-15dB, at a distance of 1-5m and a reverberation time of 0.2-0.3s, our best 8-channel, speaker-dependent enhancement system attains a perceptual evaluation of speech quality score of 2.95, up from 2.43 for our single-channel system. Followed by speaker-independent ASR for a 230K-word recognition task, we achieve a word error rate of 6.56%, down from 17.89% for enhanced speech of the single-channel system, and from 48.47% for unprocessed noisy speech of the reference channel.

引用

页码：21 / 25

页数：5

共 48 条

[1]

[Anonymous], 2013, COMPUT REV

[2]

[Anonymous], 2004, OPTIMUM ARRAY PROCES, DOI DOI 10.1002/0471221104

[3]

[Anonymous], 2009, Distant Speech Recognition

[4]

[Anonymous], 2004, 100 NONSPEECH ENV SO

[5]

Benesty J, 2008, SPRINGER TOP SIGN PR, V1, P1

[6] SPATIAL SPECTRAL FILTERING WITH LINEARLY CONSTRAINED MINIMUM VARIANCE BEAMFORMERS [J].

BUCKLEY, KM .

IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1987, 35 (03) :249-266

[7] HIGH-RESOLUTION FREQUENCY-WAVENUMBER SPECTRUM ANALYSIS [J].

CAPON, J .

PROCEEDINGS OF THE IEEE, 1969, 57 (08) :1408-&

[8] New insights into the noise reduction Wiener filter [J].

Chen, Jingdong ;

Benesty, Jacob ;

Huang, Yiteng ;

Doclo, Simon .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (04) :1218-1234

[9] Performance Analysis of Multichannel Wiener Filter-Based Noise Reduction in Hearing Aids Under Second Order Statistics Estimation Errors [J].

Cornelis, Bram ;

Moonen, Marc ;

Wouters, Jan .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (05) :1368-1381

[10] WEIGHTED OVERLAP-ADD METHOD OF SHORT-TIME FOURIER ANALYSIS-SYNTHESIS [J].

CROCHIERE, RE .

IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1980, 28 (01) :99-102

← 1 2 3 4 5 →