Research on DCNN-U-Net speech separation method based on Audio-Visual multimodal fusion

被引：0

作者：

Lan, Chaofeng ^{[1
]}

Guo, Rui ^{[1
]}

Zhang, Lei ^{[2
]}

Wang, Shunbo ^{[1
]}

Zhang, Meng ^{[3
]}

机构：

[1] Harbin Univ Sci & Technol, Sch Measurement & Commun Engn, Harbin 150080, Peoples R China

[2] Beidahuang Ind Grp Gen Hosp, Harbin 150088, Peoples R China

[3] Guangzhou Univ, Sch Elect & Commun Engn, Guangzhou 510006, Peoples R China

来源：

SIGNAL IMAGE AND VIDEO PROCESSING | 2025年 / 19卷 / 04期

关键词：

Speech separation; DCNN; U-Net; Multi-feature fusion;

D O I：

10.1007/s11760-025-03836-y

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

With the rapid development of computer technology, acquiring audio-visual signals in a complex environment is not difficult. Combining the visual information to assist speech separation shows excellent potential. However, the problem of speech signal separation in multiple speakers containing facial information in audio-visual scenes has not been well solved. Due to the strong correlation between the speaker's lip information and the sound signal, this paper, based on atrous convolution Neural Network (DCNN) and U-Net, proposes a DCNN-U-Net speech separation model for audio-visual fusion. The model uses fused signals from lips and audio for training to better focus on the audio signal in the speaker, achieving the effect of aided speech separation. The experiments were tested based on the AVspeech dataset, and the speech separation effect was evaluated using PESQ, STOI, and SDR metrics. The experimental results show that the DCNN-U-Net model has better audio-visual speech separation than the AV and DCNN-LSTM models.

引用

页数：11

共 33 条

[1] Afouras T., 2018, Enhancement, V34, P15
[2] A review on speech separation in cocktail party environment: challenges and approaches
Agrawal, Jharna
Gupta, Manish
Garg, Hitendra
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (20) : 31035 - 31067
[3] Aldarmaki I, 2024, 2410.05019
[4] [Anonymous], 2018, 2AV016 ICLR
[5] Multimodal Attentive Fusion Network for audio-visual event recognition
Brousmiche, Mathilde
Rouat, Jean
Dupont, Stephane
[J]. INFORMATION FUSION, 2022, 85 : 52 - 59
[6] Chern I, 2023, Audio-visual speech enhancement and separation by leveraging multimodal self-supervised embeddings
[7] Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
Ephrat, Ariel
Mosseri, Inbar
Lang, Oran
Dekel, Tali
Wilson, Kevin
Hassidim, Avinatan
Freeman, William T.
Rubinstein, Michael
[J]. ACM TRANSACTIONS ON GRAPHICS, 2018, 37 (04):
[8] Gabbay A, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P3051, DOI 10.1109/ICASSP.2018.8462527
[9] Gogate M., 2022, P 2020 INT JOINT C N
[10] Multi-Modal Multi-Channel Target Speech Separation
Gu, Rongzhi
Zhang, Shi-Xiong
Xu, Yong
Chen, Lianwu
Zou, Yuexian
Yu, Dong
[J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2020, 14 (03) : 530 - 541

← 1 2 3 4 →