End-to-end Video-level Representation Learning for Action Recognition

被引:0
作者
Zhu, Jiagang [1 ,2 ]
Zhu, Zheng [1 ,2 ]
Zou, Wei [1 ,3 ]
机构
[1] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
[3] CASIA Co Ltd, TianJin Intelligent Tech Inst, Beijing, Peoples R China
来源
2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR) | 2018年
基金
中国国家自然科学基金; 国家高技术研究发展计划(863计划);
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
From the frame/clip-level feature learning to the video-level representation building, deep learning methods in action recognition have developed rapidly in recent years. However, current methods suffer from the confusion caused by partial observation training, or without end-to-end learning, or restricted to single temporal scale modeling and so on. In this paper, we build upon two-stream ConvNets and propose Deep networks with Temporal Pyramid Pooling (DTPP), an end-to-end video-level representation learning approach, to address these problems. Specifically, at first, RGB images and optical flow stacks are sparsely sampled across the whole video. Then a temporal pyramid pooling layer is used to aggregate the frame-level features which consist of spatial and temporal cues. Lastly, the trained model has compact video-level representation with multiple temporal scales, which is both global and sequence-aware. Experimental results show that DTPP achieves the state-of-the-art performance on two challenging video action datasets: UCF101 and HMDB51, either by ImageNet pre-training or Kinetics pre-training.
引用
收藏
页码:645 / 650
页数:6
相关论文
共 34 条
[1]  
[Anonymous], ARXIV170805465V1
[2]  
[Anonymous], ARXIV170903655
[3]  
[Anonymous], 2015, ICCV
[4]  
[Anonymous], ARXIV170502953V1
[5]  
[Anonymous], 2016, ECCV
[6]  
[Anonymous], ARXIV150301224V2
[7]  
[Anonymous], 2016, IJCV
[8]  
[Anonymous], 2015, P IEEE INT C COMPUTE
[9]  
[Anonymous], 2015, CVPR
[10]  
[Anonymous], 2014, NIPS