Speech Intelligibility Potential of General and Specialized Deep Neural Network Based Speech Enhancement Systems

被引:142
作者
Kolbaek, Morten [1 ]
Tan, Zheng-Hua [1 ]
Jensen, Jesper [1 ,2 ]
机构
[1] Aalborg Univ, Dept Elect Syst, DK-9220 Aalborg, Denmark
[2] Oticon AS, DK-2765 Smorum, Denmark
关键词
Deep neural networks; generalizability; ideal ratio mask; intelligibility; speech enhancement; SQUARE ERROR ESTIMATION; NOISE; ALGORITHM; RATIO; OPTIMIZATION; COEFFICIENTS; RECOGNITION; REAL;
D O I
10.1109/TASLP.2016.2628641
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, we study aspects of single microphone speech enhancement (SE) based on deep neural networks (DNNs). Specifically, we explore the generalizability capabilities of state-of- the-art DNN-based SE systems with respect to the background noise type, the gender of the target speaker, and the signal-to-noise ratio (SNR). Furthermore, we investigate how specialized DNN-based SE systems, which have been trained to be either noise type specific, speaker specific or SNR specific, perform relative to DNN-based SE systems that have been trained to be noise type general, speaker general, and SNR general. Finally, we compare how a DNN-based SE system trained to be noise type general, speaker general, and SNR general performs relative to a state-of-the- art short-time spectral amplitude minimum mean square error (STSA-MMSE) based SE algorithm. We show that DNN-based SE systems, when trained specifically to handle certain speakers, noise types and SNRs, are capable of achieving large improvements in estimated speech quality (SQ) and speech intelligibility (SI), when tested in matched conditions. Furthermore, we show that improvements in estimated SQ and SI can be achieved by a DNN-based SE system when exposed to unseen speakers, genders and noise types, given a large number of speakers and noise types have been used in the training of the system. In addition, we show that a DNN-based SE system that has been trained using a large number of speakers and a wide range of noise types outperforms a state-of-the- art STSA-MMSE based SE method, when tested using a range of unseen speakers and noise types. Finally, a listening test using several DNN-based SE systems tested in unseen speaker conditions show that these systems can improve SI for some SNR and noise type configurations but degrade SI for others.
引用
收藏
页码:153 / 167
页数:15
相关论文
共 69 条
  • [1] Amodei D., 2015, CoRR
  • [2] [Anonymous], SPEECH ENHANCEMENT T
  • [3] [Anonymous], 2006, NIPS
  • [4] Barker J, 2015, 2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), P504, DOI 10.1109/ASRU.2015.7404837
  • [5] Bishop C. M., 2006, PATTERN RECOGNITION, V2006
  • [6] Long Short-Term Memory for Speaker Generalization in Supervised Speech Separation
    Chen, Jitong
    Wang, DeLiang
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 3314 - 3318
  • [7] Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises
    Chen, Jitong
    Wang, Yuxuan
    Yoho, Sarah E.
    Wang, DeLiang
    Healy, Eric W.
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2016, 139 (05) : 2604 - 2612
  • [8] Noise perturbation for supervised speech separation
    Chen, Jitong
    Wang, Yuxuan
    Wang, DeLiang
    [J]. SPEECH COMMUNICATION, 2016, 78 : 1 - 10
  • [9] A Feature Study for Classification-Based Speech Separation at Low Signal-to-Noise Ratios
    Chen, Jitong
    Wang, Yuxuan
    Wang, DeLiang
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (12) : 1993 - 2002
  • [10] Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging
    Cohen, I
    [J]. IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2003, 11 (05): : 466 - 475