Coarse-to-fine speech separation method in the time-frequency domain

被引:2
|
作者
Yang, Xue [1 ]
Bao, Changchun [1 ]
Chen, Xianhong [1 ]
机构
[1] Beijing Univ Technol, Speech & Audio Signal Proc Lab, Fac Informat Technol, Beijing 100124, Peoples R China
基金
中国国家自然科学基金;
关键词
Speech enhancement; Speech separation; Coarse -to -fine speech separation; Recurrent neural network; Attention mechanism;
D O I
10.1016/j.specom.2023.103003
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Although time-domain speech separation methods have exhibited the outstanding performance in anechoic scenarios, their effectiveness is considerably reduced in the reverberant scenarios. Compared to the time-domain methods, the speech separation methods in time-frequency (T-F) domain mainly concern the structured T-F representations and have shown a great potential recently. In this paper, we propose a coarse-to-fine speech separation method in the T-F domain, which involves two steps: 1) a rough separation conducted in the coarse phase and 2) a precise extraction accomplished in the refining phase. In the coarse phase, the speech signals of all speakers are initially separated in a rough manner, resulting in some level of distortion in the estimated signals. In the refining phase, the T-F representation of each estimated signal acts as a guide to extract the residual T-F representation for the corresponding speaker, which helps to reduce the distortions caused in the coarse phase. Besides, the specially designed networks used for the coarse and refining phases are jointly trained for superior performance. Furthermore, utilizing the recurrent attention with parallel branches (RAPB) block to fully exploit the contextual information contained in the whole T-F features, the proposed model demonstrates competitive performance on clean datasets with a small number of parameters. Additionally, the proposed method shows more robustness and achieves state-of-the-art results on more realistic datasets.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Initialization method for speech separation algorithms that work in the time-frequency domain
    Sarmiento, Auxiliadora
    Duran-Diaz, Ivan
    Cruces, Sergio
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2010, 127 (04): : EL121 - EL126
  • [2] Coarse-to-fine Optimization for Speech Enhancement
    Yao, Jian
    Al-Dahle, Ahmad
    INTERSPEECH 2019, 2019, : 2743 - 2747
  • [3] TFPSNET: TIME-FREQUENCY DOMAIN PATH SCANNING NETWORK FOR SPEECH SEPARATION
    Yang, Lei
    Liu, Wei
    Wang, Weiqin
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6842 - 6846
  • [4] A Method of Sound Segmentation in Time-Frequency Domain Using Peaks and Valleys in Spectrogram for Speech Separation
    Lim, Sung-Kil
    Lee, Hyon-Soo
    JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2008, 27 (08): : 418 - 426
  • [5] A TIME-FREQUENCY BLIND SEPARATION METHOD FOR UNDERDETERMINED SPEECH MIXTURES
    Lv Yao Li Shuangtian(Institute of Acoustics
    JournalofElectronics(China), 2008, (05) : 702 - 708
  • [6] Underdetermined Blind Source Separation of Anechoic Speech Mixtures in the Time-frequency Domain
    Lv Yao
    Li Shuangtian
    ICSP: 2008 9TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, VOLS 1-5, PROCEEDINGS, 2008, : 22 - 25
  • [7] A coarse-to-fine unsupervised domain adaptation method based on metric learning
    Peng, Yaxin
    Yang, Keni
    Zhao, Fangrong
    Shen, Chaomin
    Zhang, Yangchun
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2024, 46 (01) : 3013 - 3027
  • [8] Segmentation on time-frequency domain for speech segregation
    Lim, Sung-Kil
    Lee, Hyon-Soo
    2006 INTERNATIONAL SYMPOSIUM ON INTELLIGENT SIGNAL PROCESSING AND COMMUNICATIONS, VOLS 1 AND 2, 2006, : 433 - +
  • [9] Watermarking of speech signals in the time-frequency domain
    Al-Khassaweneh, Mahmood
    Al-Zoubi, Hussein
    Aviyente, Selin
    2009 IEEE INTERNATIONAL CONFERENCE ON ELECTRO/INFORMATION TECHNOLOGY, 2009, : 317 - +
  • [10] Neural speech enhancement in the time-frequency domain
    Volkmer, M
    2003 IEEE XIII WORKSHOP ON NEURAL NETWORKS FOR SIGNAL PROCESSING - NNSP'03, 2003, : 617 - 626