U-NET: A Supervised Approach for Monaural Source Separation

被引:1
作者
Basir, Samiul [1 ]
Hossain, Md. Nahid [1 ]
Hosen, Md. Shakhawat [1 ]
Ali, Md. Sadek [2 ,3 ]
Riaz, Zainab [3 ]
Islam, Md. Shohidul [1 ,3 ]
机构
[1] Islamic Univ, Dept Comp Sci & Engn, Kushtia 7003, Bangladesh
[2] Islamic Univ, Dept Informat & Commun Technol, Kushtia 7003, Bangladesh
[3] Hong Kong Ctr Cerebro Cardiovasc Hlth Engn COCHE, Hong Kong, Peoples R China
关键词
Speech separation; Source separation; Short-time Fourier transform (STFT); U-NET; MUSIC-SEPARATION; NEURAL-NETWORKS; SPEECH;
D O I
10.1007/s13369-024-08785-1
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Separating speech is a challenging area of research, especially when trying to separate the desired source from its combination. Deep learning has arisen as a promising solution, surpassing traditional methods. While prior research has mainly focused on the magnitude, log-magnitude, or a combination of the magnitude and phase portions, a new approach using the Short-time Fourier Transform (STFT), and a deep Convolutional Neural Network named U-NET has been proposed. This method, unlike others, considers both the real and imaginary components for decomposition. During the training stage, the mixed time-domain signal undergoes a transformation into a frequency-domain signal by using STFT, producing a mixed complex spectrogram. The spectrogram's real and imaginary parts are then divided and combined into a single matrix. The newly formed matrix is fed through U-NET to extract the source components. The same process is repeated at testing. The resulting concatenated matrix for the mixed test signal is passed through the saved model to generate two enhanced concatenated matrices for each source. These matrices are then transformed back into time-domain signals using inverse STFT by extracting the magnitude and phase. The proposed approach has been evaluated using the GRID audio visual corpuses, with results showing improved quality and intelligibility compared to the existing methods, as demonstrated by objective measurement metrics.
引用
收藏
页码:12679 / 12691
页数:13
相关论文
共 34 条
[1]  
Cho K, 2014, ARXIV
[2]   An audio-visual corpus for speech perception and automatic speech recognition (L) [J].
Cooke, Martin ;
Barker, Jon ;
Cunningham, Stuart ;
Shao, Xu .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 120 (05) :2421-2424
[3]   Single-Channel Speech-Music Separation for Robust ASR With Mixture Models [J].
Demir, Cemil ;
Saraclar, Murat ;
Cemgil, Ali Taylan .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2013, 21 (04) :725-736
[4]   SINGLE CHANNEL AUDIO SOURCE SEPARATION USING CONVOLUTIONAL DENOISING AUTOENCODERS [J].
Grais, Emad M. ;
Plumbley, Mark D. .
2017 IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (GLOBALSIP 2017), 2017, :1265-1269
[5]  
Graves A, 2012, STUD COMPUT INTELL, V385, P1, DOI [10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]
[6]   Dual transform based joint learning single channel speech separation using generative joint dictionary learning [J].
Hossain, Md Imran ;
Al Mahmud, Tarek Hasan ;
Islam, Md Shohidul ;
Hossen, Md Bipul ;
Khan, Rashid ;
Ye, Zhongfu .
MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (20) :29321-29346
[7]   Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation [J].
Huang, Po-Sen ;
Kim, Minje ;
Hasegawa-Johnson, Mark ;
Smaragdis, Paris .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2015, 23 (12) :2136-2147
[8]  
Jiang D, 2021, Wireless Communications and Mobile Computing, V2021, P1
[9]   The Hearing-Aid Speech Perception Index (HASPI) Version 2 [J].
Kates, James M. ;
Arehart, Kathryn H. .
SPEECH COMMUNICATION, 2021, 131 :35-46
[10]   The Hearing-Aid Speech Quality Index (HASQI) Version 2 [J].
Kates, James M. ;
Arehart, Kathryn H. .
JOURNAL OF THE AUDIO ENGINEERING SOCIETY, 2014, 62 (03) :99-117