Efficient Gated Convolutional Recurrent Neural Networks for Real-Time Speech Enhancement

被引:5
作者
Fazal-E-Wahab [1 ]
Ye, Zhongfu [1 ]
Saleem, Nasir [2 ]
Ali, Hamza [3 ]
Ali, Imad [4 ]
机构
[1] Univ Sci & Technol China, Natl Engn Lab Speech & Language Informat Proc, Hefei 230026, Anhui, Peoples R China
[2] Gomal Univ, Fac Engn & Technol, Dept Elect Engn, Dera Ismail Khan, Kpk, Pakistan
[3] Univ Engn & Technol, Dept Elect Engn, Mardan, Kpk, Pakistan
[4] Univ Swat, Dept Comp Sci, Swat, Kpk, Pakistan
关键词
Convolutional Recurrent Networks; Deep Learning; GRU; Intelligibility; LSTM; Speech Enhancement;
D O I
10.9781/ijimai.2023.05.007
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Deep learning (DL) networks have grown into powerful alternatives for speech enhancement and have achieved excellent results by improving speech quality, intelligibility, and background noise suppression. Due to high computational load, most of the DL models for speech enhancement are difficult to implement for realtime processing. It is challenging to formulate resource efficient and compact networks. In order to address this problem, we propose a resource efficient convolutional recurrent network to learn the complex ratio mask for real-time speech enhancement. Convolutional encoder-decoder and gated recurrent units (GRUs) are integrated into the Convolutional recurrent network architecture, thereby formulating a causal system appropriate for real-time speech processing. Parallel GRU grouping and efficient skipped connection techniques are engaged to achieve a compact network. In the proposed network, the causal encoder-decoder is composed of five convolutional (Conv2D) and deconvolutional (Deconv2D) layers. Leaky linear rectified unit (ReLU) is applied to all layers apart from the output layer where softplus activation to confine the network output to positive is utilized. Furthermore, batch normalization is adopted after every convolution (or deconvolution) and prior to activation. In the proposed network, different noise types and speakers can be used in training and testing. With the LibriSpeech dataset, the experiments show that the proposed real-time approach leads to improved objective perceptual quality and intelligibility with much fewer trainable parameters than existing LSTM and GRU models. The proposed model obtained an average of 83.53% STOI scores and 2.52 PESQ scores, respectively. The quality and intelligibility are improved by 31.61% and 17.18% respectively over noisy speech.
引用
收藏
页码:66 / 74
页数:209
相关论文
共 43 条
[1]  
Agnew J, 2000, J Am Acad Audiol, V11, P330
[2]   Motivic Pattern Classification of Music Audio Signals Combining Residual and LSTM Networks [J].
Arronte Alvarez, Aitor ;
Gomez, Francisco .
INTERNATIONAL JOURNAL OF INTERACTIVE MULTIMEDIA AND ARTIFICIAL INTELLIGENCE, 2021, 6 (06) :208-214
[3]  
Ba J, 2014, ACS SYM SER
[4]  
Beerends JG, 2002, J AUDIO ENG SOC, V50, P765
[5]   Long short-term memory for speaker generalization in supervised speech separation [J].
Chen, Jitong ;
Wang, DeLiang .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2017, 141 (06) :4705-4714
[6]   Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises [J].
Chen, Jitong ;
Wang, Yuxuan ;
Yoho, Sarah E. ;
Wang, DeLiang ;
Healy, Eric W. .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2016, 139 (05) :2604-2612
[7]  
Cho K, 2014, P SSST 8 8 WORKSH SY, P103, DOI [10.3115/v1/W14-4012, DOI 10.3115/V1/W14-4012]
[8]  
Dubey Arun Kumar, 2019, Applications of Computing, Automation and Wireless Systems in Electrical Engineering. Proceedings of MARC 2018. Lecture Notes in Electrical Engineering (LNEE 553), P873, DOI 10.1007/978-981-13-6772-4_76
[9]  
Erdogan H, 2015, INT CONF ACOUST SPEE, P708, DOI 10.1109/ICASSP.2015.7178061
[10]  
Gao T, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5054, DOI 10.1109/ICASSP.2018.8461861