Video Action Transformer Network

被引:541
作者
Girdhar, Rohit [1 ,2 ]
Carreira, Joao [2 ]
Doersch, Carl [2 ]
Zisserman, Andrew [2 ,3 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[2] DeepMind, London, England
[3] Univ Oxford, Oxford, England
来源
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019) | 2019年
关键词
D O I
10.1109/CVPR.2019.00033
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We introduce the Action Transformer model for recognizing and localizing human actions in video clips. We repurpose a Transformer-style architecture to aggregate features from the spatiotemporal context around the person whose actions we are trying to classify. We show that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people and to pick up on semantic context from the actions of others. Additionally its attention mechanism learns to emphasize hands and faces, which are often crucial to discriminate an action - all without explicit supervision other than boxes and class labels. We train and test our Action Transformer network on the Atomic Visual Actions (AVA) dataset, out-performing the state-of-the-art by a significant margin using only raw RGB frames as input.
引用
收藏
页码:244 / 253
页数:10
相关论文
共 50 条
[1]  
Abu-El-Haija S., 2016, Youtube-8M: A large-scale video classification benchmark
[2]  
Ba Jimmy Lei, 2016, Stat
[3]   Object Level Visual Reasoning in Videos [J].
Baradel, Fabien ;
Neverova, Natalia ;
Wolf, Christian ;
Mille, Julien ;
Mori, Greg .
COMPUTER VISION - ECCV 2018, PT XIII, 2018, 11217 :106-122
[4]   Human Action Recognition: Pose-based Attention draws focus to Hands [J].
Baradel, Fabien ;
Wolf, Christian ;
Mille, Julien .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2017), 2017, :604-613
[5]  
Baradel Fabien, 2018, BMVC
[6]  
Caba Fabian., Activitynet leaderboard. spatio-temporal action localization (ava-1. computer vision only)
[7]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[8]  
Duarte K, 2018, ADV NEUR IN, V31
[9]  
GIRDHAR R, 2017, P IEEE C COMP VIS PA, P971, DOI DOI 10.1109/CVPR.2017.337
[10]   Detect-and-Track:Efficient Pose Estimation in Videos [J].
Girdhar, Rohit ;
Gkioxari, Georgia ;
Torresani, Lorenzo ;
Paluri, Manohar ;
Tran, Du .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :350-359