Asynchronous Interaction Aggregation for Action Detection

被引:59
作者
Tang, Jiajun [1 ]
Xia, Jin [1 ]
Mu, Xinzhi [1 ]
Pang, Bo [1 ]
Lu, Cewu [1 ]
机构
[1] Shanghai Jiao Tong Univ, Shanghai, Peoples R China
来源
COMPUTER VISION - ECCV 2020, PT XV | 2020年 / 12360卷
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
Action detection; Video understanding; Interaction; Memory;
D O I
10.1007/978-3-030-58555-6_5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Understanding interaction is an essential part of video action detection. We propose the Asynchronous Interaction Aggregation network (AIA) that leverages different interactions to boost action detection. There are two key designs in it: one is the Interaction Aggregation structure (IA) adopting a uniform paradigm to model and integrate multiple types of interaction; the other is the Asynchronous Memory Update algorithm (AMU) that enables us to achieve better performance by modeling very long-term interaction dynamically without huge computation cost. We provide empirical evidence to show that our network can gain notable accuracy from the integrative interactions and is easy to train end-to-end. Our method reports the new state-of-the-art performance on AVA dataset, with 3.7 mAP gain (12.6% relative improvement) on validation split comparing to our strong baseline. The results on datasets UCF101-24 and EPIC-Kitchens further illustrate the effectiveness of our approach. Source code will be made public at: https://github.com/MVIG-SJTU/AlphAction.
引用
收藏
页码:71 / 87
页数:17
相关论文
共 47 条
[1]   Object Level Visual Reasoning in Videos [J].
Baradel, Fabien ;
Neverova, Natalia ;
Wolf, Christian ;
Mille, Julien ;
Mori, Greg .
COMPUTER VISION - ECCV 2018, PT XIII, 2018, 11217 :106-122
[2]   A non-local algorithm for image denoising [J].
Buades, A ;
Coll, B ;
Morel, JM .
2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 2, PROCEEDINGS, 2005, :60-65
[3]  
Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
[4]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[5]  
Christoph R., 2016, P ADV NEUR INF PROC, P3468
[6]  
Dai ZH, 2019, Arxiv, DOI arXiv:1901.02860
[7]   Scaling Egocentric Vision: The EPIC-KITCHENS Dataset [J].
Damen, Dima ;
Doughty, Hazel ;
Farinella, Giovanni Maria ;
Fidler, Sanja ;
Furnari, Antonino ;
Kazakos, Evangelos ;
Moltisanti, Davide ;
Munro, Jonathan ;
Perrett, Toby ;
Price, Will ;
Wray, Michael .
COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 :753-771
[8]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[9]   Spatio-temporal Channel Correlation Networks for Action Classification [J].
Diba, Ali ;
Fayyaz, Mohsen ;
Sharma, Vivek ;
Arzani, M. Mahdi ;
Yousefzadeh, Rahman ;
Gall, Juergen ;
Van Gool, Luc .
COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 :299-315
[10]   SlowFast Networks for Video Recognition [J].
Feichtenhofer, Christoph ;
Fan, Haoqi ;
Malik, Jitendra ;
He, Kaiming .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6201-6210