Speech Intelligibility Potential of General and Specialized Deep Neural Network Based Speech Enhancement Systems

被引：145

作者：

Kolbaek, Morten ^{[1
]}

Tan, Zheng-Hua ^{[1
]}

Jensen, Jesper ^{[1
,2
]}

机构：

[1] Aalborg Univ, Dept Elect Syst, DK-9220 Aalborg, Denmark

[2] Oticon AS, DK-2765 Smorum, Denmark

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2017年 / 25卷 / 01期

关键词：

Deep neural networks; generalizability; ideal ratio mask; intelligibility; speech enhancement; SQUARE ERROR ESTIMATION; NOISE; ALGORITHM; RATIO; OPTIMIZATION; COEFFICIENTS; RECOGNITION; REAL;

D O I：

10.1109/TASLP.2016.2628641

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

In this paper, we study aspects of single microphone speech enhancement (SE) based on deep neural networks (DNNs). Specifically, we explore the generalizability capabilities of state-of- the-art DNN-based SE systems with respect to the background noise type, the gender of the target speaker, and the signal-to-noise ratio (SNR). Furthermore, we investigate how specialized DNN-based SE systems, which have been trained to be either noise type specific, speaker specific or SNR specific, perform relative to DNN-based SE systems that have been trained to be noise type general, speaker general, and SNR general. Finally, we compare how a DNN-based SE system trained to be noise type general, speaker general, and SNR general performs relative to a state-of-the- art short-time spectral amplitude minimum mean square error (STSA-MMSE) based SE algorithm. We show that DNN-based SE systems, when trained specifically to handle certain speakers, noise types and SNRs, are capable of achieving large improvements in estimated speech quality (SQ) and speech intelligibility (SI), when tested in matched conditions. Furthermore, we show that improvements in estimated SQ and SI can be achieved by a DNN-based SE system when exposed to unseen speakers, genders and noise types, given a large number of speakers and noise types have been used in the training of the system. In addition, we show that a DNN-based SE system that has been trained using a large number of speakers and a wide range of noise types outperforms a state-of-the- art STSA-MMSE based SE method, when tested using a range of unseen speakers and noise types. Finally, a listening test using several DNN-based SE systems tested in unseen speaker conditions show that these systems can improve SI for some SNR and noise type configurations but degrade SI for others.

引用

页码：153 / 167

页数：15

共 69 条

[1]

Amodei D., 2015, CoRR

[2]

[Anonymous], SPEECH ENHANCEMENT T

[3]

[Anonymous], 2006, NIPS

[4]

Barker J, 2015, 2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), P504, DOI 10.1109/ASRU.2015.7404837

[5]

Bishop C. M., 2006, PATTERN RECOGNITION, V2006

[6] Long Short-Term Memory for Speaker Generalization in Supervised Speech Separation [J].

Chen, Jitong ;

Wang, DeLiang .

17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :3314-3318

[7] Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises [J].

Chen, Jitong ;

Wang, Yuxuan ;

Yoho, Sarah E. ;

Wang, DeLiang ;

Healy, Eric W. .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2016, 139 (05) :2604-2612

[8] Noise perturbation for supervised speech separation [J].

Chen, Jitong ;

Wang, Yuxuan ;

Wang, DeLiang .

SPEECH COMMUNICATION, 2016, 78 :1-10

[9] A Feature Study for Classification-Based Speech Separation at Low Signal-to-Noise Ratios [J].

Chen, Jitong ;

Wang, Yuxuan ;

Wang, DeLiang .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (12) :1993-2002

[10] Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging [J].

Cohen, I .

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2003, 11 (05) :466-475

← 1 2 3 4 5 6 7 →