Exploiting spatio-temporal knowledge for video action recognition

被引:3
作者
Zhang, Huigang [1 ]
Wang, Liuan [1 ]
Sun, Jun [1 ]
机构
[1] Fujitsu R&D Ctr, Beijing 100022, Peoples R China
关键词
action recognition; commonsense knowledge; GCN; STKM;
D O I
10.1049/cvi2.12154
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Action recognition has been a popular area of computer vision research in recent years. The goal of this task is to recognise human actions in video frames. Most existing methods often depend on the visual features and their relationships inside the videos. The extracted features only represent the visual information of the current video itself and cannot represent the general knowledge of particular actions beyond the video. Thus, there are some deviations in these features, and the recognition performance still requires improvement. In this sudy, we present a novel spatio-temporal knowledge module (STKM) to endow the current methods with commonsense knowledge. To this end, we first collect hybrid external knowledge from universal fields, which contains both visual and semantic information. Then graph convolution networks (GCN) are used to represent and aggregate this knowledge. The GCNs involve (i) a spatial graph to capture spatial relations and (ii) a temporal graph to capture serial occurrence relations among actions. By integrating knowledge and visual features, we can get better recognition results. Experiments on AVA, UCF101-24 and JHMDB datasets show the robustness and generalisation ability of STKM. The results report a new state-of-the-art 32.0 mAP on AVA v2.1. On UCF101-24 and JHMDB datasets, our method also improves by 1.5 AP and 2.6 AP, respectively, over the baseline method.
引用
收藏
页码:222 / 230
页数:9
相关论文
共 46 条
[1]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[2]   Commonsense Reasoning and Commonsense Knowledge in Artificial Intelligence [J].
Davis, Ernest ;
Marcus, Gary .
COMMUNICATIONS OF THE ACM, 2015, 58 (09) :92-103
[3]  
Devlin J., 2018, NAACLHLT
[4]  
Donahue J, 2015, PROC CVPR IEEE, P2625, DOI 10.1109/CVPR.2015.7298878
[5]  
Dosovitskiy A., 2020, INT C LEARNING REPRE
[6]   SlowFast Networks for Video Recognition [J].
Feichtenhofer, Christoph ;
Fan, Haoqi ;
Malik, Jitendra ;
He, Kaiming .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6201-6210
[7]  
Feldman J, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P1173
[8]   Actor-Transformers for Group Activity Recognition [J].
Gavrilyuk, Kirill ;
Sanford, Ryan ;
Javan, Mehrsan ;
Snoek, Cees G. M. .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :836-845
[9]   Video Action Transformer Network [J].
Girdhar, Rohit ;
Carreira, Joao ;
Doersch, Carl ;
Zisserman, Andrew .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :244-253
[10]   AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions [J].
Gu, Chunhui ;
Sun, Chen ;
Ross, David A. ;
Vondrick, Carl ;
Pantofaru, Caroline ;
Li, Yeqing ;
Vijayanarasimhan, Sudheendra ;
Toderici, George ;
Ricco, Susanna ;
Sukthankar, Rahul ;
Schmid, Cordelia ;
Malik, Jitendra .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6047-6056