Enhanced speech separation through a supervised approach using bidirectional long short-term memory in dual domains

被引:0
作者
Basir, Samiul [1 ]
Hosen, Md Shakhawat [1 ]
Hossain, Md Nahid [1 ]
Aktaruzzaman, Md [1 ]
Ali, Md Sadek [2 ,3 ]
Islam, Md Shohidul [1 ,3 ]
机构
[1] Islamic Univ, Dept Comp Sci & Engn, Kushtia 7003, Bangladesh
[2] Islamic Univ, Dept Informat & Commun Technol, Kushtia 7003, Bangladesh
[3] Hong Kong Ctr Cerebro Cardiovasc Hlth Engn COCHE, Hong Kong, Peoples R China
关键词
Bi-directional long short-term memory; Dual tree complex wavelet transform; Short-time Fourier transform; Source separation; Speech separation; NEURAL-NETWORKS; OPTIMIZATION;
D O I
10.1016/j.compeleceng.2024.109364
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The process of separating individual sound sources from mono audio is a complex yet essential endeavor in audio signal processing and analysis. This article presents an algorithm tailored for bidirectional transformations aimed at effectively isolating speech from single -channel audio. Leveraging the dual -tree complex wavelet transform (DTCWT) on time -domain signals circumvents limitations inherent in the discrete wavelet transform (DWT), such as its incapacity to manage substantial shifts and inability to discern the correct direction. In this process, a series of subband signals is generated and subjected to the short -time Fourier transform (STFT) to create a complex spectrogram, which is then transformed into its absolute value and input into the Bi-directional Long Short -Term Memory (Bi-LSTM) network with a specified number of layers and units. This network utilizes the bidirectional capabilities of LSTM units to understand both the preceding and subsequent contexts of the input data, enabling the identification of specific speech components, aided by the ideal soft mask components that serve as corresponding labels. The final predicted signal is obtained by element -wise multiplication of the complex spectrogram by the estimated mask produced by the model. Subsequently, the inverse STFT is applied with parameters consistent with the initial transform, followed by the inverse DTCWT on the refined source elements with the same decomposition levels and wavelet filters. The improved efficacy of the proposed method for source separation quality was validated through experimental assessments conducted on the GRID audio-visual and TIMIT databases, considering metrics such as SDR, SIR, SAR, SNR, PESQ, and STOI.
引用
收藏
页数:17
相关论文
共 48 条
[1]   SHORT-TERM SPECTRAL ANALYSIS, SYNTHESIS, AND MODIFICATION BY DISCRETE FOURIER-TRANSFORM [J].
ALLEN, JB .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1977, 25 (03) :235-238
[2]   Effective Monoaural Speech Separation through Convolutional Top-Down Multi-View Network [J].
Aung, Aye Nyein ;
Liao, Che-Wei ;
Hung, Jeih-Weih .
FUTURE INTERNET, 2024, 16 (05)
[3]   U-NET: A Supervised Approach for Monaural Source Separation [J].
Basir, Samiul ;
Hossain, Md. Nahid ;
Hosen, Md. Shakhawat ;
Ali, Md. Sadek ;
Riaz, Zainab ;
Islam, Md. Shohidul .
ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2024, 49 (09) :12679-12691
[4]  
Brueckner R, 2014, INT CONF ACOUST SPEE
[5]  
Cho KYHY, 2014, Arxiv, DOI [arXiv:1406.1078, DOI 10.48550/ARXIV.1406.1078, 10.48550/arXiv.1406.1078]
[6]   An audio-visual corpus for speech perception and automatic speech recognition (L) [J].
Cooke, Martin ;
Barker, Jon ;
Cunningham, Stuart ;
Shao, Xu .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 120 (05) :2421-2424
[7]   Single-Channel Speech-Music Separation for Robust ASR With Mixture Models [J].
Demir, Cemil ;
Saraclar, Murat ;
Cemgil, Ali Taylan .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2013, 21 (04) :725-736
[8]  
Fan ZC, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P726, DOI 10.1109/ICASSP.2018.8462091
[9]   End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks [J].
Fu, Szu-Wei ;
Wang, Tao-Wei ;
Tsao, Yu ;
Lu, Xugang ;
Kawai, Hisashi .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (09) :1570-1584
[10]  
Garofolo J. S., 1993, Linguistic Data Consortium