Two-stage deep learning approach for speech enhancement and reconstruction in the frequency and time domains

被引:4
作者
Nossier, Soha A. [1 ]
Wall, Julie [1 ]
Moniri, Mansour [1 ]
Glackin, Cornelius [2 ]
Cannings, Nigel [2 ]
机构
[1] Univ East London, Dept Engn & Comp, London, England
[2] Intelligent Voice Ltd, London, England
来源
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) | 2022年
基金
欧盟地平线“2020”;
关键词
Deep learning; denoising autoencoders; speech enhancement; speech features; speech reconstruction; NEURAL-NETWORK; NOISE;
D O I
10.1109/IJCNN55064.2022.9892355
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Deep learning has recently shown promising improvement in the speech enhancement field, due to its effectiveness in eliminating noise. However, a drawback of the denoising process is the introduction of speech distortion, which negatively affects speech quality and intelligibility. In this work, we propose a deep convolutional denoising autoencoder-based speech enhancement network that is designed to have an encoder deeper than the decoder, to improve performance and decrease complexity. Furthermore, we present a two-stage learning approach, in which denoising is performed in the first frequency domain stage using magnitude spectrum as a training target; while, in the second stage, further denoising and speech reconstruction are performed in the time domain. Results show that our architecture achieves 0.22 improvement in the overall predicted mean opinion score (Covl) over state of the art speech enhancement architectures, using the Valentini dataset benchmark. Moreover, the architecture was trained using a larger dataset and tested using a mismatched test corpus, to achieve 0.7 and 6.35% improvement in Perceptual Evaluation of Speech Quality (PESQ) and Short Time Objective Intelligibility (STOI) scores, respectively, compared to the noisy speech.
引用
收藏
页数:10
相关论文
共 54 条
[31]  
Rethage D, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5069, DOI 10.1109/ICASSP.2018.8462417
[32]   Deep learning [J].
Rusk, Nicole .
NATURE METHODS, 2016, 13 (01) :35-35
[33]  
Scalart P, 1996, INT CONF ACOUST SPEE, P629, DOI 10.1109/ICASSP.1996.543199
[34]  
Soni MH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5039, DOI 10.1109/ICASSP.2018.8462068
[35]  
Strake M, 2019, IEEE WORK APPL SIG, P239, DOI [10.1109/WASPAA.2019.8937222, 10.1109/waspaa.2019.8937222]
[36]   An Algorithm for Intelligibility Prediction of Time-Frequency Weighted Noisy Speech [J].
Taal, Cees H. ;
Hendriks, Richard C. ;
Heusdens, Richard ;
Jensen, Jesper .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (07) :2125-2136
[37]  
Topcoder, 2017, 176 SPOK LANG
[38]  
Valentini-Botinhao C, 2017, Tech. Rep.
[39]  
Valentini-Botinhao C., 2016, REV SPEECH DATABASE
[40]   ASSESSMENT FOR AUTOMATIC SPEECH RECOGNITION .2. NOISEX-92 - A DATABASE AND AN EXPERIMENT TO STUDY THE EFFECT OF ADDITIVE NOISE ON SPEECH RECOGNITION SYSTEMS [J].
VARGA, A ;
STEENEKEN, HJM .
SPEECH COMMUNICATION, 1993, 12 (03) :247-251