Uncertainty-Aware Dual-Evidential Learning for Weakly-Supervised Temporal Action Localization

被引:10
作者
Chen, Mengyuan [1 ,2 ]
Gao, Junyu [1 ,2 ]
Xu, Changsheng [1 ,2 ,3 ]
机构
[1] Chinese Acad Sci, Inst Automat, State Key Lab Multimodal Artificial Intelligence, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 101408, Peoples R China
[3] Peng Cheng Lab, Shenzhen 518055, Peoples R China
基金
北京市自然科学基金; 中国国家自然科学基金;
关键词
Weakly-supervised temporal action localization; evidential deep learning; uncertainty estimation; ATTENTION;
D O I
10.1109/TPAMI.2023.3308571
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Weakly-supervised temporal action localization (WTAL) aims to localize the action instances and recognize their categories with only video-level labels. Despite great progress, existing methods suffer from severe action-background ambiguity, which mainly arises from background noise and neglect of non-salient action snippets. To address this issue, we propose a generalized evidential deep learning (EDL) framework for WTAL, called Uncertainty-aware Dual-Evidential Learning (UDEL), which extends the traditional paradigm of EDL to adapt to the weakly-supervised multi-label classification goal with the guidance of epistemic and aleatoric uncertainties, of which the former comes from models lacking knowledge, while the latter comes from the inherent properties of samples themselves. Specifically, targeting excluding the undesirable background snippets, we fuse the video-level epistemic and aleatoric uncertainties to measure the interference of background noise to video-level prediction. Then, the snippet-level aleatoric uncertainty is further deduced for progressive mutual learning, which gradually focuses on the entire action instances in an "easy-to-hard" manner and encourages the snippet-level epistemic uncertainty to be complementary with the foreground attention scores. Extensive experiments show that UDEL achieves state-of-the-art performance on four public benchmarks. Our code is available in github/mengyuanchen2021/UDEL.
引用
收藏
页码:15896 / 15911
页数:16
相关论文
共 99 条
[1]  
Abu-El-Haija S., 2016, YouTube-8m: A Large-Scale Video Classification Benchmark
[2]  
Amini A., 2020, ADV NEUR IN
[3]   Evidential Deep Learning for Open Set Action Recognition [J].
Bao, Wentao ;
Yu, Qi ;
Kong, Yu .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :13329-13338
[4]  
Bengio Y, 2009, P 26 ANN INT C MACH, P41, DOI [10.1145/1553374.1553380, DOI 10.1145/1553374.1553380]
[5]  
Bojanowski P, 2014, LECT NOTES COMPUT SC, V8693, P628, DOI 10.1007/978-3-319-10602-1_41
[6]  
Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
[7]  
Caliski T., 1974, Communications in Statistics-theory and Methods, V3, P1, DOI DOI 10.1080/03610927408827101
[8]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[9]   Rethinking the Faster R-CNN Architecture for Temporal Action Localization [J].
Chao, Yu-Wei ;
Vijayanarasimhan, Sudheendra ;
Seybold, Bryan ;
Ross, David A. ;
Deng, Jia ;
Sukthankar, Rahul .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1130-1139
[10]   Dual-Evidential Learning for Weakly-supervised Temporal Action Localization [J].
Chen, Mengyuan ;
Gao, Junyu ;
Yang, Shicai ;
Xu, Changsheng .
COMPUTER VISION - ECCV 2022, PT IV, 2022, 13664 :192-208