Lightweight Real-Time Recurrent Models for Speech Enhancement and Automatic Speech Recognition

被引:0
作者
Dhahbi, Sami [1 ]
Saleem, Nasir [2 ]
Gunawan, Teddy Surya [3 ]
Bourouis, Sami [4 ]
Ali, Imad [5 ]
Trigui, Aymen [6 ]
Algarni, Abeer D. [7 ]
机构
[1] King Khalid Univ, Coll Sci & Art Mahayil, Dept Comp Sci, Muhayil Aseer 62529, Saudi Arabia
[2] Gomal Univ, Dept Elect Engn, FET, Dera Ismail Khan 29050, KPK, Pakistan
[3] Int Islamic Univ Malaysia, Elect & Comp Engn Dept, Kuala Lumpur, Malaysia
[4] Taif Univ, Coll Comp & Informat Technol, Dept Informat Technol, At Taif 21944, Saudi Arabia
[5] Univ Swat, Dept Forens Sci, Swat, Pakistan
[6] King Khalid Univ, Coll Comp Sci, Dept Comp Sci, Abha, Saudi Arabia
[7] Princess Nourah Bint Abdulrahman Univ, Coll Comp & Informat Sci, Dept Informat Technol, POB 84428, Riyadh 11671, Saudi Arabia
关键词
Real-Time Speech; Simple Recurrent Unit (SRU); Speech Enhancement; Speech Processing; Speech Quality; NEURAL-NETWORKS; FEEDFORWARD; MASKING; NOISE; LSTM;
D O I
10.9781/ijimai.2024.04.003
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Traditional recurrent neural networks (RNNs) encounter difficulty in capturing long-term temporal dependencies. However, lightweight recurrent models for speech enhancement are important to improve speech, while being computationally efficient and able to capture long-term temporal dependencies efficiently. This study proposes a lightweight hourglass -shaped model for speech enhancement (SE) and automatic recognition (ASR). Simple recurrent units (SRU) with skip connections are implemented where attention gates are added to the skip connections, highlighting the important features and spectral regions. The operates without relying on future information that is well -suited for real-time processing. Combined acoustic features and two training objectives are estimated. Experimental evaluations using the short time intelligibility (STOI), perceptual evaluation of speech quality (PESQ), and word error rates (WERs) indicate better intelligibility, perceptual quality, and word recognition rates. The composite measures further confirm the performance of residual noise and speech distortion. With the TIMIT database, the proposed improves the STOI and PESQ by 16.21% and 0.69 (31.1%) whereas with the LibriSpeech database, the improves STOI by 16.41% and PESQ by 0.71 (32.9%) over the noisy speech. Further, our model outperforms other deep neural networks (DNNs) in seen and unseen conditions. The ASR performance is measured the Kaldi toolkit and achieves 15.13% WERs in noisy backgrounds.
引用
收藏
页码:74 / 85
页数:194
相关论文
共 60 条
[1]   SUPPRESSION OF ACOUSTIC NOISE IN SPEECH USING SPECTRAL SUBTRACTION [J].
BOLL, SF .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1979, 27 (02) :113-120
[2]   Auditory-Inspired Morphological Processing of Speech Spectrograms: Applications in Automatic Speech Recognition and Speech Enhancement [J].
Cadore, Joyner ;
Valverde-Albacete, Francisco J. ;
Gallardo-Antolin, Ascension ;
Pelaez-Moreno, Carmen .
COGNITIVE COMPUTATION, 2013, 5 (04) :426-441
[3]  
Chang B, 2018, Arxiv, DOI [arXiv:1710.10348, 10.48550/arXiv.1710.10348]
[4]   Long short-term memory for speaker generalization in supervised speech separation [J].
Chen, Jitong ;
Wang, DeLiang .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2017, 141 (06) :4705-4714
[5]  
Damayanti T. F., 2023, JOMLAI: Journal of Machine Learning and Artificial Intelligence, V2, P105
[6]  
Defossez Alexandre., 2020, arXiv
[7]   SPEECH ENHANCEMENT USING A MINIMUM MEAN-SQUARE ERROR LOG-SPECTRAL AMPLITUDE ESTIMATOR [J].
EPHRAIM, Y ;
MALAH, D .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1985, 33 (02) :443-445
[8]   SPEECH ENHANCEMENT USING A MINIMUM MEAN-SQUARE ERROR SHORT-TIME SPECTRAL AMPLITUDE ESTIMATOR [J].
EPHRAIM, Y ;
MALAH, D .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1984, 32 (06) :1109-1121
[9]   Efficient Gated Convolutional Recurrent Neural Networks for Real-Time Speech Enhancement [J].
Fazal-E-Wahab ;
Ye, Zhongfu ;
Saleem, Nasir ;
Ali, Hamza ;
Ali, Imad .
INTERNATIONAL JOURNAL OF INTERACTIVE MULTIMEDIA AND ARTIFICIAL INTELLIGENCE, 2024, 9 (01) :66-74
[10]   An attention Long Short-Term Memory based system for automatic classification of speech intelligibility [J].
Fernandez-Diaz, Miguel ;
Gallardo-Antolin, Ascension .
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2020, 96