Multi-Level Signal Fusion for Enhanced Weakly-Supervised Audio-Visual Video Parsing

被引:1
作者
Sun, Xin [1 ]
Wang, Xuan [2 ]
Liu, Qiong [3 ]
Zhou, Xi [3 ]
机构
[1] Shanghai Jiao Tong Univ, Cooperat Medianet Innovat Ctr, Shanghai 200240, Peoples R China
[2] Shanghai Univ Int Business & Econ, Sch Management, Shanghai 201620, Peoples R China
[3] CloudWalk Technol, Shanghai 201203, Peoples R China
关键词
Visualization; Proposals; Training; Task analysis; Feature extraction; Noise; Self-supervised learning; Multi-modal signal processing; weakly supervised learning; multi-level signal fusion;
D O I
10.1109/LSP.2024.3388957
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
The weakly-supervised audio-visual video parsing (AVVP) task aims toparse a video into temporal events and predict their modality-specific categories. Current works primarily focus on refining training strategies and follow the framework fusing signals only at the segment level. However, they miss the point that video events, being composed of consecutive segments, require the integration of both local and global contexts to fully capture their essence. In this letter, we present the <bold>L</bold>ocal-<bold>G</bold>lobal <bold>F</bold>usion <bold>Net</bold>work (<bold>LGFNet</bold>), designed to facilitate multi-level interaction between audio and visual signals. Specifically, we create a two-dimensional map to generate multi-scale event proposals for both audio and visual modalities. Subsequently, we fuse audio and visual signals at both segment and event levels with a novel boundary-aware feature aggregation method, enabling the simultaneous capture of local and global information. To enhance the temporal alignment between the two modalities, we employ segment-level and event-level contrastive learning. In-depth experiments demonstrate the superiority of our LGFNet.
引用
收藏
页码:1149 / 1153
页数:5
相关论文
共 29 条
[1]   Multi-Modal Emotion Recognition by Fusing Correlation Features of Speech-Visual [J].
Chen Guanghui ;
Zeng Xiaoping .
IEEE SIGNAL PROCESSING LETTERS, 2021, 28 :533-537
[2]   Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing [J].
Cheng, Haoyue ;
Liu, Zhaoyang ;
Zhou, Hang ;
Qian, Chen ;
Wu, Wayne ;
Wang, Limin .
COMPUTER VISION, ECCV 2022, PT XXXIV, 2022, 13694 :431-448
[3]   Collecting Cross-Modal Presence-Absence Evidence for Weakly-Supervised Audio-Visual Event Perception [J].
Gao, Junyu ;
Chen, Mengyuan ;
Xu, Changsheng .
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :18827-18836
[4]   Momentum Contrast for Unsupervised Visual Representation Learning [J].
He, Kaiming ;
Fan, Haoqi ;
Wu, Yuxin ;
Xie, Saining ;
Girshick, Ross .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :9726-9735
[5]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[6]  
Hershey S, 2017, INT CONF ACOUST SPEE, P131, DOI 10.1109/ICASSP.2017.7952132
[7]  
Lamba J, 2021, Arxiv, DOI arXiv:2104.04598
[8]   Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual Speech Separation [J].
Lee, Jiyoung ;
Chung, Soo-Whan ;
Kim, Sunok ;
Kang, Hong-Goo ;
Sohn, Kwanghoon .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :1336-1345
[9]  
Lin YB, 2021, ADV NEUR IN, V34
[10]   DUAL-MODALITY SEQ2SEQ NETWORK FOR AUDIO-VISUAL EVENT LOCALIZATION [J].
Lin, Yan-Bo ;
Li, Yu-Jhe ;
Wang, Yu-Chiang Frank .
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, :2002-2006