Joint Deep Neural Network for Single-Channel Speech Separation on Masking-Based Training Targets

被引:1
|
作者
Chen, Peng [1 ]
Nguyen, Binh Thien [2 ]
Geng, Yuting [2 ]
Iwai, Kenta [2 ]
Nishiura, Takanobu [2 ]
机构
[1] Ritsumeikan Univ, Grad Sch Informat Sci & Engn, Osaka 5678570, Japan
[2] Ritsumeikan Univ, Coll Informat Sci & Engn, Osaka, Ibaraki 5678570, Japan
来源
IEEE ACCESS | 2024年 / 12卷
基金
日本学术振兴会;
关键词
Training; Signal to noise ratio; Hidden Markov models; Speech recognition; Speech enhancement; Time-frequency analysis; Distortion measurement; Interference; Fitting; Artificial neural networks; Single-channel speech separation; time-frequency mask; deep neural network; joint network; ideal binary mask; ideal ratio mask; Wiener filter; spectral magnitude mask; SPEAKER RECOGNITION; ENHANCEMENT; NOISE; BINARY;
D O I
10.1109/ACCESS.2024.3479292
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Single-channel speech separation can be adopted in many applications. Time-frequency (T-F) masking is an effective method for single-channel speech separation. With advancements in deep learning, T-F masks have become used as a training target, achieving notable separation results. Among the numerous masks that have been proposed, the ideal binary mask (IBM), ideal ratio mask (IRM), Wiener filter (WF) and spectral magnitude mask (SMM) are commonly used and have proven effective, though their separation performance varies depending on the speech mixture and separation model. The existing approach mainly utilizes a single network to approximate the mask of the target speech. However, in mixed speech, there are segments where speech is mixed with other speech, segments where speech is mixed with silent intervals, and segments where high signal-to-noise ratio (SNR) speech is mixed due to pauses and variations in the speakers' intonation and emphasis. In this paper, we attempt to use different networks to handle speech segments containing various mixtures. In addition to the existing network, we introduce a network (using the Rectified Linear Unit as activation functions) to specifically address segments containing a mixture of speech and silence, as well as segments with high SNR speech mixtures. We conducted evaluation experiments on the speech separation of two speakers using the four aforementioned masks as training targets. The performance improvements observed in the evaluation experiments demonstrate the effectiveness of our proposed method based on the joint network compared to the conventional method based on the single network.
引用
收藏
页码:152036 / 152044
页数:9
相关论文
共 50 条
  • [11] Single-Channel Speech Enhancement Based on Psychoacoustic Masking
    Zhou, Tingting
    Zeng, Yumin
    Wang, Rongrong
    JOURNAL OF THE AUDIO ENGINEERING SOCIETY, 2017, 65 (04): : 272 - 284
  • [12] Single-Channel Speech Separation Based on Deep Clustering with Local Optimization
    Fu, Taotao
    Yu, Ge
    Guo, Lili
    Wang, Yan
    Liang, Ji
    2017 3RD INTERNATIONAL CONFERENCE ON FRONTIERS OF SIGNAL PROCESSING (ICFSP), 2017, : 44 - 49
  • [13] A feature study for masking-based reverberant speech separation
    Delfarah, Masood
    Wang, DeLiang
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 555 - 559
  • [14] Phase sensitive masking-based single channel speech enhancement using conditional generative adversarial network
    Routray, Sidheswar
    Mao, Qirong
    COMPUTER SPEECH AND LANGUAGE, 2022, 71
  • [15] Masking-based Neural Beamformer for Multichannel Speech Enhancement
    Nie, Shuai
    Liang, Shan
    Yang, Zhanlei
    Xiao, Longshuai
    Liu, Wenju
    Tao, Jianhua
    2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 125 - 129
  • [16] GAUSSIAN DENSITY GUIDED DEEP NEURAL NETWORK FOR SINGLE-CHANNEL SPEECH ENHANCEMENT
    Chai, Li
    Du, Jun
    Wang, Yan-nan
    2017 IEEE 27TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, 2017,
  • [17] NOISE-ADAPTIVE DEEP NEURAL NETWORK FOR SINGLE-CHANNEL SPEECH ENHANCEMENT
    Chung, Hanwook
    Kim, Taesup
    Plourde, Eric
    Champagne, Benoit
    2018 IEEE 28TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2018,
  • [18] A Gender Mixture Detection Approach to Unsupervised Single-Channel Speech Separation Based on Deep Neural Networks
    Wang, Yannan
    Du, Jun
    Dai, Li-Rong
    Lee, Chin-Hui
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (07) : 1535 - 1546
  • [19] A Joint Approach for Single-Channel Speaker Identification and Speech Separation
    Mowlaee, Pejman
    Saeidi, Rahim
    Christensen, Mads Grsboll
    Tan, Zheng-Hua
    Kinnunen, Tomi
    Franti, Pasi
    Jensen, Soren Holdt
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (09): : 2586 - 2601
  • [20] Impact of phase estimation on single-channel speech separation based on time-frequency masking
    Mayer, Florian
    Williamson, Donald S.
    Mowlaee, Pejman
    Wang, DeLiang
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2017, 141 (06): : 4668 - 4679