MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition

被引:189
作者
Zhou, Yizhou [1 ,2 ]
Sun, Xiaoyan [2 ]
Zha, Zheng-Jun [1 ]
Zeng, Wenjun [2 ]
机构
[1] Univ Sci & Technol China, Hefei, Anhui, Peoples R China
[2] Microsoft Res Asia, Beijing, Peoples R China
来源
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2018年
关键词
D O I
10.1109/CVPR.2018.00054
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Human actions in videos are three-dimensional (3D) signals. Recent attempts use 3D convolutional neural networks (CNNs) to explore spatio-temporal information for human action recognition. Though promising, 3D CNNs have not achieved high performanceon on this task with respect to their well-established two-dimensional (2D) counterparts for visual recognition in still images. We argue that the high training complexity of spatio-temporal fusion and the huge memory cost of 3D convolution hinder current 3D CNNs, which stack 3D convolutions layer by layer, by outputting deeper feature maps that are crucial for high-level tasks. We thus propose a Mixed Convolutional Tube (MiCT) that integrates 2D CNNs with the 3D convolution module to generate deeper and more informative feature maps, while reducing training complexity in each round of spatio-temporal fusion. A new end-to-end trainable deep 3D network, MiCTNet, is also proposed based on the MiCT to better explore spatio-temporal information in human actions. Evaluations on three well-known benchmark datasets (UCF101, Sport IM and HMDB-5I) show that the proposed MiCT-Net significantly outperforms the original 3D CNNs. Compared with state-of-the-art approaches for action recognition on UCF101 and HMDB5 I, our MiCT-Net yields the best performance.
引用
收藏
页码:449 / 458
页数:10
相关论文
共 41 条
[1]  
[Anonymous], 2016, ARXIV161106678
[2]  
[Anonymous], 2016, Advances in Neural Information Processing Systems
[3]  
Baccouche Moez, 2011, Human Behavior Unterstanding. Proceedings Second International Workshop, HBU 2011, P29, DOI 10.1007/978-3-642-25446-8_4
[4]  
Baccouche M, 2010, LECT NOTES COMPUT SC, V6353, P154
[5]  
BILEN H, 2016, PROC CVPR IEEE, P3034, DOI DOI 10.1109/CVPR.2016.331
[6]  
Camgoz NC, 2016, INT C PATT RECOG, P49, DOI 10.1109/ICPR.2016.7899606
[7]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[8]  
Chang N, 2017, IEEE INT ULTRA SYM
[9]  
Courtney PG, 2015, IEEE COMP SEMICON
[10]  
Diba, 2016, ARXIV160808851