Slowfast Diversity-aware Prototype Learning for Egocentric Action Recognition

被引:2
作者
Dai, Guangzhao [1 ]
Shu, Xiangbo [1 ]
Yan, Rui [2 ]
Huang, Peng [1 ]
Tang, Jinhui [1 ]
机构
[1] Nanjing Univ Sci & Technol, Nanjing, Peoples R China
[2] Nanjing Univ, Nanjing, Peoples R China
来源
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年
基金
国家重点研发计划; 中国国家自然科学基金; 中国博士后科学基金;
关键词
Egocentric Action Recognition; Prototype Learning; Video Understanding;
D O I
10.1145/3581783.3612144
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Egocentric Action Recognition (EAR) is required to recognize both the interacting objects (noun) and the motion (verb) against cluttered backgrounds with distracting objects. For capturing interacting objects, traditional approaches heavily rely on luxury object annotations or detectors, though a few works heuristically enumerate the fixed sets of verb-constrained prototypes to roughly exclude the background. For capturing motion, the inherent variations of motion duration among egocentric videos with different lengths are almost ignored. To this end, we propose a novel Slowfast Diversity-aware Prototype learning (SDP) to effectively capture interacting objects by learning compact yet diverse prototypes, and adaptively capture motion in either long-time video or short-time video. Specifically, we present a new Part-to-Prototype (P2P) scheme to learn prototypes from raw videos covering the interacting objects by refining the semantic information from part level to prototype level. Moreover, for adaptively capturing motion, we design a new Slow-Fast Context (SFC) mechanism that explores the Up/Down augmentations for the prototype representation at the semantic level to strengthen the transient dynamic information in short-time videos and eliminate the redundant dynamic information in longtime videos, which are further fine-complemented via the slow-and fast-aware attentions. Extensive experiments demonstrate SDP outperforms state-of-the-art methods on two large-scale egocentric video benchmarks, i.e., EPIC-KITCHENS-100 and EGTEA.
引用
收藏
页码:7549 / 7558
页数:10
相关论文
共 81 条
[1]   ViViT: A Video Vision Transformer [J].
Arnab, Anurag ;
Dehghani, Mostafa ;
Heigold, Georg ;
Sun, Chen ;
Lucic, Mario ;
Schmid, Cordelia .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826
[2]  
Barrera Thor, 2014, 2014 17th International Symposium on Electromagnetic Launch Technology (EML), P1, DOI 10.1109/EML.2014.6920629
[3]  
Bertasius G, 2021, PR MACH LEARN RES, V139
[4]  
Bulat Adrian, 2021, Advances in neural information processing systems, V34, P19594
[5]  
Calway A, 2015, P BRIT MACH VIS C BM, P1
[6]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[7]   The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines [J].
Damen, Dima ;
Doughty, Hazel ;
Farinella, Giovanni Maria ;
Fidler, Sanja ;
Furnari, Antonino ;
Kazakos, Evangelos ;
Moltisanti, Davide ;
Munro, Jonathan ;
Perrett, Toby ;
Price, Will ;
Wray, Michael .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (11) :4125-4141
[8]  
Damen Dima, 2020, ARXIV200613256
[9]   Summarization of Egocentric Videos: A Comprehensive Survey [J].
del Molino, Ana Garcia ;
Tan, Cheston ;
Lim, Joo-Hwee ;
Tan, Ah-Hwee .
IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, 2017, 47 (01) :65-76
[10]   Towards Anomaly-resistant Graph Neural Networks via Reinforcement Learning [J].
Ding, Kaize ;
Shan, Xuan ;
Liu, Huan .
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, :2979-2983