Joint Deep Neural Network for Single-Channel Speech Separation on Masking-Based Training Targets

被引：1

作者：

Chen, Peng ^{[1
]}

Nguyen, Binh Thien ^{[2
]}

Geng, Yuting ^{[2
]}

Iwai, Kenta ^{[2
]}

Nishiura, Takanobu ^{[2
]}

机构：

[1] Ritsumeikan Univ, Grad Sch Informat Sci & Engn, Osaka 5678570, Japan

[2] Ritsumeikan Univ, Coll Informat Sci & Engn, Osaka, Ibaraki 5678570, Japan

来源：

IEEE ACCESS | 2024年 / 12卷

基金：

日本学术振兴会;

关键词：

Training; Signal to noise ratio; Hidden Markov models; Speech recognition; Speech enhancement; Time-frequency analysis; Distortion measurement; Interference; Fitting; Artificial neural networks; Single-channel speech separation; time-frequency mask; deep neural network; joint network; ideal binary mask; ideal ratio mask; Wiener filter; spectral magnitude mask; SPEAKER RECOGNITION; ENHANCEMENT; NOISE; BINARY;

D O I：

10.1109/ACCESS.2024.3479292

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Single-channel speech separation can be adopted in many applications. Time-frequency (T-F) masking is an effective method for single-channel speech separation. With advancements in deep learning, T-F masks have become used as a training target, achieving notable separation results. Among the numerous masks that have been proposed, the ideal binary mask (IBM), ideal ratio mask (IRM), Wiener filter (WF) and spectral magnitude mask (SMM) are commonly used and have proven effective, though their separation performance varies depending on the speech mixture and separation model. The existing approach mainly utilizes a single network to approximate the mask of the target speech. However, in mixed speech, there are segments where speech is mixed with other speech, segments where speech is mixed with silent intervals, and segments where high signal-to-noise ratio (SNR) speech is mixed due to pauses and variations in the speakers' intonation and emphasis. In this paper, we attempt to use different networks to handle speech segments containing various mixtures. In addition to the existing network, we introduce a network (using the Rectified Linear Unit as activation functions) to specifically address segments containing a mixture of speech and silence, as well as segments with high SNR speech mixtures. We conducted evaluation experiments on the speech separation of two speakers using the four aforementioned masks as training targets. The performance improvements observed in the evaluation experiments demonstrate the effectiveness of our proposed method based on the joint network compared to the conventional method based on the single network.

引用

页码：152036 / 152044

页数：9

共 50 条

[31] Error Modeling via Asymmetric Laplace Distribution for Deep Neural Network Based Single-Channel Speech Enhancement
Chai, Li
Du, Jun
Lee, Chin-Hui
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 3269 - 3273
[32] Real-time single-channel deep neural network-based speech enhancement on edge devices
Shankar, Nikhil
Bhat, Gautam Shreedhar
Panahi, Issa M. S.
INTERSPEECH 2020, 2020, : 3281 - 3285
[33] A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech
Yan-Hui Tu
Jun Du
Chin-Hui Lee
Journal of Signal Processing Systems, 2018, 90 : 963 - 973
[34] Discriminatively Trained Recurrent Neural Networks for Single-Channel Speech Separation
Weninger, Felix
Hershey, John R.
Le Roux, Jonathan
Schuller, Bjoern
2014 IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (GLOBALSIP), 2014, : 577 - 581
[35] Design of a novel masking-based deep convolutional neural network approach for accident classification
Kanakala, V. Raviteja
Mohan, K. Jagan
Reddy, V. Krishna
INTERNATIONAL JOURNAL OF MODELING SIMULATION AND SCIENTIFIC COMPUTING, 2024, 15 (04)
[36] JOINT TRAINING OF DEEP NEURAL NETWORKS FOR MULTI-CHANNEL DEREVERBERATION AND SPEECH SOURCE SEPARATION
Togami, Masahito
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 3032 - 3036
[37] A Regression Approach to Single-Channel Speech Separation Via High-Resolution Deep Neural Networks
Du, Jun
Tu, Yanhui
Dai, Li-Rong
Lee, Chin-Hui
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (08) : 1424 - 1437
[38] Single-channel Speech Separation based on Gaussian Process Regression
Le Dinh Nguyen
Chen, Sih-Huei
Tai, Tzu-Chiang
Wang, Jia-Ching
2018 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM 2018), 2018, : 275 - 278
[39] SINGLE-CHANNEL MIXED SPEECH RECOGNITION USING DEEP NEURAL NETWORKS
Weng, Chao
Yu, Dong
Seltzer, Michael L.
Droppo, Jasha
2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
[40] Perceptual Weighting Deep Neural Networks for Single-channel Speech Enhancement
Han, Wei
Zhang, Xiongwei
Min, Gang
Zhou, Xingyu
Zhang, Wei
PROCEEDINGS OF THE 2016 12TH WORLD CONGRESS ON INTELLIGENT CONTROL AND AUTOMATION (WCICA), 2016, : 446 - 450

← 1 2 3 4 5 →