Adversarial Action Prediction Networks

被引:41
作者
Kong, Yu [1 ,2 ]
Tao, Zhiqiang [2 ]
Fu, Yun [2 ,3 ]
机构
[1] Rochester Inst Technol, B Thomas Golisano Coll Comp & Informat Sci, Rochester, NY 14623 USA
[2] Northeastern Univ, Dept ECE, Boston, MA 02115 USA
[3] Northeastern Univ, Coll CIS, Boston, MA 02115 USA
关键词
Action prediction; action recognition; sequential context; variational autoencoder; adversarial learning;
D O I
10.1109/TPAMI.2018.2882805
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Different from after-the-fact action recognition, action prediction task requires action labels to be predicted from partially observed videos containing incomplete action executions. It is challenging because these partial videos have insufficient discriminative information, and their temporal structure is damaged. We study this problem in this paper, and propose an efficient and powerful deep network for learning representative and discriminative features for action prediction. Our approach exploits abundant sequential context information in full videos to enrich the feature representations of partial videos. This information is encoded in latent representations using a variational autoencoder (VAE), which are encouraged to be progress-invariant. Decoding such latent representations using another VAE, we can reconstruct missing information in the features extracted from partial videos. An adversarial learning scheme is adopted to differentiate the reconstructed features from the features directly extracted from full videos in order to well align their distributions. A multi-class classifier is also used to encourage the features to be discriminative. Our network jointly learns features and classifiers, and generates the features particularly optimized for action prediction. Extensive experimental results on UCF101, Sports-1M and BIT datasets demonstrate that our approach remarkably outperforms state-of-the-art methods, and shows significant speedup over these methods. Results also show that actions differ in their prediction characteristics; some actions can be correctly predicted even though only the beginning 10% portion of videos is observed.
引用
收藏
页码:539 / 553
页数:15
相关论文
共 50 条
[1]   Social LSTM: Human Trajectory Prediction in Crowded Spaces [J].
Alahi, Alexandre ;
Goel, Kratarth ;
Ramanathan, Vignesh ;
Robicquet, Alexandre ;
Li Fei-Fei ;
Savarese, Silvio .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :961-971
[2]  
[Anonymous], INT J COMPUT VIS
[3]  
[Anonymous], 2018, P IEEE C COMP VIS PA
[4]  
[Anonymous], 2018, P IEEE C COMP VIS PA
[5]  
[Anonymous], 2018, P IEEE C COMP VIS PA
[6]   Recognize Human Activities from Partially Observed Videos [J].
Cao, Yu ;
Barrett, Daniel ;
Barbu, Andrei ;
Narayanaswamy, Siddharth ;
Yu, Haonan ;
Michaux, Aaron ;
Lin, Yuewei ;
Dickinson, Sven ;
Siskind, Jeffrey Mark ;
Wang, Song .
2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2013, :2658-2665
[7]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[8]  
Chen M., 2012, P 29 INT COFERENCE I, P1627
[9]  
Dollar P., 2005, Proceedings. 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS) (IEEE Cat. No. 05EX1178), P65
[10]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497