Multi-target ensemble learning based speech enhancement with temporal-spectral structured target

被引:2
作者
Wang, Wenbo [1 ]
Guo, Weiwei [2 ,3 ,4 ]
Liu, Houguang [1 ]
Yang, Jianhua [1 ]
Liu, Songyong [1 ]
机构
[1] China Univ Min & Technol, Sch Mechatron Engn, Xuzhou 221116, Peoples R China
[2] Chinese Peoples Liberat Army Gen Hosp, Coll Otolaryngol Head & Neck Surg, Beijing 100853, Peoples R China
[3] Natl Clin Res Ctr Otolaryngol Dis, Beijing 100853, Peoples R China
[4] Minist Educ, Key Lab Hearing Sci, Beijing 100853, Peoples R China
关键词
Speech enhancement; Temporal -spectral structured target; Multi -target ensemble learning; Sparse nonnegative matrix factorization; RECURRENT NEURAL-NETWORKS; TRAINING TARGETS; NOISE; SEPARATION; FEATURES; QUALITY; BINARY; INTELLIGIBILITY; RECOGNITION; ALGORITHM;
D O I
10.1016/j.apacoust.2023.109268
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recently, deep neural network (DNN)-based speech enhancement has shown considerable success, and mapping-based and masking-based are the two most commonly used methods. However, these methods do not consider the spectrum structures of signal. In this paper, a novel structured multi-target ensemble learning (SMTEL) framework is proposed, which uses target temporal-spectral structures to improve speech quality and intelligibility. First, the basis matrices of clean speech, noise, and ideal ratio mask (IRM) are captured by the sparse nonnegative matrix factorization, which contain the basic structures of the signal. Second, the basis matrices are co-trained with a multi-target DNN to estimate the activation matrices instead of directly estimating the targets. Then a joint training single layer perceptron is pro-posed to integrate the two targets and further improve speech quality and intelligibility. The sequential floating forward selection method is used to systematically analyze the impact of the integrated targets on enhanced performance, and analyze the effect of the target weights on the results. Finally, the pro-posed method with progressive learning is combined to improve the enhanced performance. Systematic experiments on the UW/NU corpus show that the proposed method achieves the best enhancement effect in the case of low network cost and complexity, especially in visible nonstationary noise environment. Compared with the target integration method which does not use structured targets and the long short-term memory masking method, the speech quality of the proposed method is improved by 25.6 % and 29.2 % of restaurant noise, and the speech intelligibility is improved by 35.5 % and 15.8 %, respectively.(c) 2023 Elsevier Ltd. All rights reserved.
引用
收藏
页数:13
相关论文
共 59 条
  • [1] UNIFIED APPROACH TO SHORT-TIME FOURIER-ANALYSIS AND SYNTHESIS
    ALLEN, JB
    RABINER, LR
    [J]. PROCEEDINGS OF THE IEEE, 1977, 65 (11) : 1558 - 1564
  • [2] Algorithms and applications for approximate nonnegative matrix factorization
    Berry, Michael W.
    Browne, Murray
    Langville, Amy N.
    Pauca, V. Paul
    Plemmons, Robert J.
    [J]. COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2007, 52 (01) : 155 - 173
  • [3] SUPPRESSION OF ACOUSTIC NOISE IN SPEECH USING SPECTRAL SUBTRACTION
    BOLL, SF
    [J]. IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1979, 27 (02): : 113 - 120
  • [4] Speaker separation in realistic noise environments with applications to a cognitively-controlled hearing aid
    Borgstrom, Bengt J.
    Brandstein, Michael S.
    Ciccarelli, Gregory A.
    Quatieri, Thomas F.
    Smalt, Christopher J.
    [J]. NEURAL NETWORKS, 2021, 140 : 136 - 147
  • [5] A dual-stream deep attractor network with multi-domain learning for speech dereverberation and separation
    Chen, Hangting
    Zhang, Pengyuan
    [J]. NEURAL NETWORKS, 2021, 141 : 238 - 248
  • [6] Long short-term memory for speaker generalization in supervised speech separation
    Chen, Jitong
    Wang, DeLiang
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2017, 141 (06) : 4705 - 4714
  • [7] A Feature Study for Classification-Based Speech Separation at Low Signal-to-Noise Ratios
    Chen, Jitong
    Wang, Yuxuan
    Wang, DeLiang
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (12) : 1993 - 2002
  • [8] Multi-objective based multi-channel speech enhancement with BiLSTM network
    Cui, Xingyue
    Chen, Zhe
    Yin, Fuliang
    [J]. APPLIED ACOUSTICS, 2021, 177
  • [9] Features for Masking-Based Monaural Speech Separation in Reverberant Conditions
    Delfarah, Masood
    Wang, DeLiang
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (05) : 1085 - 1094
  • [10] SPEECH ENHANCEMENT USING A MINIMUM MEAN-SQUARE ERROR SHORT-TIME SPECTRAL AMPLITUDE ESTIMATOR
    EPHRAIM, Y
    MALAH, D
    [J]. IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1984, 32 (06): : 1109 - 1121