Collecting Cross-Modal Presence-Absence Evidence for Weakly-Supervised Audio-Visual Event Perception

被引:25
作者
Gao, Junyu [1 ,2 ]
Chen, Mengyuan [1 ,2 ]
Xu, Changsheng [1 ,2 ,3 ]
机构
[1] Chinese Acad Sci CASIA, Inst Automat, State Key Lab Multimodal Artificial Intelligence, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[3] Peng Cheng Lab, Shenzhen, Peoples R China
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年
基金
中国国家自然科学基金; 北京市自然科学基金;
关键词
D O I
10.1109/CVPR52729.2023.01805
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With only video-level event labels, this paper targets at the task of weakly-supervised audio-visual event perception (WS-AVEP), which aims to temporally localize and categorize events belonging to each modality. Despite the recent progress, most existing approaches either ignore the unsynchronized property of audio-visual tracks or discount the complementary modality for explicit enhancement. We argue that, for an event residing in one modality, the modality itself should provide ample presence evidence of this event, while the other complementary modality is encouraged to afford the absence evidence as a reference signal. To this end, we propose to collect Cross-Modal Presence-Absence Evidence (CMPAE) in a unified framework. Specifically, by leveraging uni-modal and cross-modal representations, a presence-absence evidence collector (PAEC) is designed under Subjective Logic theory. To learn the evidence in a reliable range, we propose a joint-modal mutual learning (JML) process, which calibrates the evidence of diverse audible, visible, and audi-visible events adaptively and dynamically. Extensive experiments show that our method surpasses state-of-the-arts (e.g., absolute gains of 3.6% and 6.1% in terms of event-level visual and audio metrics). Code is available in github.com/MengyuanChen21/CVPR2023-CMPAE.
引用
收藏
页码:18827 / 18836
页数:10
相关论文
共 71 条
[1]  
Amini Alexander, 2020, NEURIPS
[2]  
[Anonymous], 2020, CVPR, DOI DOI 10.1109/CVPR42600.2020.01047
[3]  
[Anonymous], 2017, 2017 IEEE INT C ACOU, DOI DOI 10.1109/ICASSP.2017.7952261
[4]  
[Anonymous], 2019, ICASSP
[5]  
[Anonymous], 2019, CVPR, DOI DOI 10.1109/CVPR.2019.00863
[6]  
[Anonymous], 2017, INT CONF ACOUST SPEE, DOI DOI 10.1109/ICASSP.2017.7952132
[7]   Look, Listen and Learn [J].
Arandjelovic, Relja ;
Zisserman, Andrew .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :609-617
[8]  
Bao Wentao, 2021, ICCV ICCV
[9]  
Bao Yiwei, 2022, CVPR
[10]   Towards Open Set Deep Networks [J].
Bendale, Abhijit ;
Boult, Terrance E. .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1563-1572