Research on DCNN-U-Net speech separation method based on Audio-Visual multimodal fusion

被引:0
作者
Lan, Chaofeng [1 ]
Guo, Rui [1 ]
Zhang, Lei [2 ]
Wang, Shunbo [1 ]
Zhang, Meng [3 ]
机构
[1] Harbin Univ Sci & Technol, Sch Measurement & Commun Engn, Harbin 150080, Peoples R China
[2] Beidahuang Ind Grp Gen Hosp, Harbin 150088, Peoples R China
[3] Guangzhou Univ, Sch Elect & Commun Engn, Guangzhou 510006, Peoples R China
关键词
Speech separation; DCNN; U-Net; Multi-feature fusion;
D O I
10.1007/s11760-025-03836-y
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
With the rapid development of computer technology, acquiring audio-visual signals in a complex environment is not difficult. Combining the visual information to assist speech separation shows excellent potential. However, the problem of speech signal separation in multiple speakers containing facial information in audio-visual scenes has not been well solved. Due to the strong correlation between the speaker's lip information and the sound signal, this paper, based on atrous convolution Neural Network (DCNN) and U-Net, proposes a DCNN-U-Net speech separation model for audio-visual fusion. The model uses fused signals from lips and audio for training to better focus on the audio signal in the speaker, achieving the effect of aided speech separation. The experiments were tested based on the AVspeech dataset, and the speech separation effect was evaluated using PESQ, STOI, and SDR metrics. The experimental results show that the DCNN-U-Net model has better audio-visual speech separation than the AV and DCNN-LSTM models.
引用
收藏
页数:11
相关论文
共 33 条
  • [1] Afouras T., 2018, Enhancement, V34, P15
  • [2] A review on speech separation in cocktail party environment: challenges and approaches
    Agrawal, Jharna
    Gupta, Manish
    Garg, Hitendra
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (20) : 31035 - 31067
  • [3] Aldarmaki I, 2024, 2410.05019
  • [4] [Anonymous], 2018, 2AV016 ICLR
  • [5] Multimodal Attentive Fusion Network for audio-visual event recognition
    Brousmiche, Mathilde
    Rouat, Jean
    Dupont, Stephane
    [J]. INFORMATION FUSION, 2022, 85 : 52 - 59
  • [6] Chern I, 2023, Audio-visual speech enhancement and separation by leveraging multimodal self-supervised embeddings
  • [7] Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
    Ephrat, Ariel
    Mosseri, Inbar
    Lang, Oran
    Dekel, Tali
    Wilson, Kevin
    Hassidim, Avinatan
    Freeman, William T.
    Rubinstein, Michael
    [J]. ACM TRANSACTIONS ON GRAPHICS, 2018, 37 (04):
  • [8] Gabbay A, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P3051, DOI 10.1109/ICASSP.2018.8462527
  • [9] Gogate M., 2022, P 2020 INT JOINT C N
  • [10] Multi-Modal Multi-Channel Target Speech Separation
    Gu, Rongzhi
    Zhang, Shi-Xiong
    Xu, Yong
    Chen, Lianwu
    Zou, Yuexian
    Yu, Dong
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2020, 14 (03) : 530 - 541