Two-stage deep learning approach for speech enhancement and reconstruction in the frequency and time domains

被引:4
作者
Nossier, Soha A. [1 ]
Wall, Julie [1 ]
Moniri, Mansour [1 ]
Glackin, Cornelius [2 ]
Cannings, Nigel [2 ]
机构
[1] Univ East London, Dept Engn & Comp, London, England
[2] Intelligent Voice Ltd, London, England
来源
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) | 2022年
基金
欧盟地平线“2020”;
关键词
Deep learning; denoising autoencoders; speech enhancement; speech features; speech reconstruction; NEURAL-NETWORK; NOISE;
D O I
10.1109/IJCNN55064.2022.9892355
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Deep learning has recently shown promising improvement in the speech enhancement field, due to its effectiveness in eliminating noise. However, a drawback of the denoising process is the introduction of speech distortion, which negatively affects speech quality and intelligibility. In this work, we propose a deep convolutional denoising autoencoder-based speech enhancement network that is designed to have an encoder deeper than the decoder, to improve performance and decrease complexity. Furthermore, we present a two-stage learning approach, in which denoising is performed in the first frequency domain stage using magnitude spectrum as a training target; while, in the second stage, further denoising and speech reconstruction are performed in the time domain. Results show that our architecture achieves 0.22 improvement in the overall predicted mean opinion score (Covl) over state of the art speech enhancement architectures, using the Valentini dataset benchmark. Moreover, the architecture was trained using a larger dataset and tested using a mismatched test corpus, to achieve 0.7 and 6.35% improvement in Perceptual Evaluation of Speech Quality (PESQ) and Short Time Objective Intelligibility (STOI) scores, respectively, compared to the noisy speech.
引用
收藏
页数:10
相关论文
共 54 条
[1]   SUPPRESSION OF ACOUSTIC NOISE IN SPEECH USING SPECTRAL SUBTRACTION [J].
BOLL, SF .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1979, 27 (02) :113-120
[2]   Real Time Speech Enhancement in the Waveform Domain [J].
Defossez, Alexandre ;
Synnaeve, Gabriel ;
Adi, Yossi .
INTERSPEECH 2020, 2020, :3291-3295
[3]  
Du J., 2008, C INT SPEECH COMM AS
[4]   A SIGNAL SUBSPACE APPROACH FOR SPEECH ENHANCEMENT [J].
EPHRAIM, Y ;
VANTREES, HL .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1995, 3 (04) :251-266
[5]   SPEECH ENHANCEMENT USING A MINIMUM MEAN-SQUARE ERROR SHORT-TIME SPECTRAL AMPLITUDE ESTIMATOR [J].
EPHRAIM, Y ;
MALAH, D .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1984, 32 (06) :1109-1121
[6]  
Fu SW, 2019, PR MACH LEARN RES, V97
[7]   End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks [J].
Fu, Szu-Wei ;
Wang, Tao-Wei ;
Tsao, Yu ;
Lu, Xugang ;
Kawai, Hisashi .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (09) :1570-1584
[8]   Speech Denoising With Deep Feature Losses [J].
Germain, Francois G. ;
Chen, Qifeng ;
Koltun, Vladlen .
INTERSPEECH 2019, 2019, :2723-2727
[9]  
Hao X, 2020, INT CONF ACOUST SPEE, P6959, DOI [10.1109/icassp40776.2020.9053188, 10.1109/ICASSP40776.2020.9053188]
[10]   Evaluation of objective quality measures for speech enhancement [J].
Hu, Yi ;
Loizou, Philipos C. .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2008, 16 (01) :229-238