Multi-region Two-Stream R-CNN for Action Detection

被引:243
作者
Peng, Xiaojiang [1 ]
Schmid, Cordelia [1 ]
机构
[1] Inria, Thoth Team, Lab Jean Kuntzmann, Grenoble, France
来源
COMPUTER VISION - ECCV 2016, PT IV | 2016年 / 9908卷
关键词
Action detection; Faster R-CNN; Multi-region CNNs; Two stream R-CNN; ACTION RECOGNITION; LOCALIZATION;
D O I
10.1007/978-3-319-46493-0_45
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a multi-region two-stream R-CNN model for action detection in realistic videos. We start from frame-level action detection based on faster R-CNN, and make three contributions: (1) we show that a motion region proposal network generates high-quality proposals, which are complementary to those of an appearance region proposal network; (2) we show that stacking optical flow over several frames significantly improves frame-level action detection; and (3) we embed a multi-region scheme in the faster R-CNN model, which adds complementary information on body parts. We then link frame-level detections with the Viterbi algorithm, and temporally localize an action with the maximum subarray method. Experimental results on the UCF-Sports, J-HMDB and UCF101 action detection datasets show that our approach outperforms the state of the art with a significant margin in both frame-mAP and video-mAP.
引用
收藏
页码:744 / 759
页数:16
相关论文
共 44 条
[1]   Human Activity Analysis: A Review [J].
Aggarwal, J. K. ;
Ryoo, M. S. .
ACM COMPUTING SURVEYS, 2011, 43 (03)
[2]  
[Anonymous], PROC CVPR IEEE
[3]  
[Anonymous], P INT C NEUR INF PRO
[4]  
[Anonymous], 2015, Advances in Neural Information Processing Systems, DOI DOI 10.1109/TPAMI.2016.2577031
[5]  
[Anonymous], 2012, CoRR
[6]  
Bentley J., 1984, Communications of the ACM, V27, P865, DOI 10.1145/358234.381162
[7]   Poselets: Body Part Detectors Trained Using 3D Human Pose Annotations [J].
Bourdev, Lubomir ;
Malik, Jitendra .
2009 IEEE 12TH INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2009, :1365-1372
[8]   High accuracy optical flow estimation based on a theory for warping [J].
Brox, T ;
Bruhn, A ;
Papenberg, N ;
Weickert, J .
COMPUTER VISION - ECCV 2004, PT 4, 2004, 2034 :25-36
[9]   P-CNN: Pose-based CNN Features for Action Recognition [J].
Cheron, Guilhem ;
Laptev, Ivan ;
Schmid, Cordelia .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :3218-3226
[10]  
Dai JF, 2015, PROC CVPR IEEE, P3992, DOI 10.1109/CVPR.2015.7299025