Spatio-temporal Channel Correlation Networks for Action Classification

被引:135
作者
Diba, Ali [1 ,4 ]
Fayyaz, Mohsen [2 ]
Sharma, Vivek [3 ]
Arzani, M. Mahdi [4 ]
Yousefzadeh, Rahman [4 ]
Gall, Juergen [2 ]
Van Gool, Luc [1 ,4 ]
机构
[1] Katholieke Univ Leuven, ESAT PSI, Leuven, Belgium
[2] Univ Bonn, Bonn, Germany
[3] KIT, CV HCI, Karlsruhe, Germany
[4] Sensifai, Brussels, Belgium
来源
COMPUTER VISION - ECCV 2018, PT IV | 2018年 / 11208卷
基金
欧洲研究理事会;
关键词
RECOGNITION; HISTOGRAMS;
D O I
10.1007/978-3-030-01225-0_18
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The work in this paper is driven by the question if spatio-temporal correlations are enough for 3D convolutional neural networks (CNN)? Most of the traditional 3D networks use local spatio-temporal features. We introduce a new block that models correlations between channels of a 3D CNN with respect to temporal and spatial features. This new block can be added as a residual unit to different parts of 3D CNNs. We name our novel block 'Spatio-Temporal Channel Correlation' (STC). By embedding this block to the current state-of-the-art architectures such as ResNext and ResNet, we improve the performance by 2-3% on the Kinetics dataset. Our experiments show that adding STC blocks to current state-of-the-art architectures outperforms the state-of-the-art methods on the HMDB51, UCF101 and Kinetics datasets. The other issue in training 3D CNNs is about training them from scratch with a huge labeled dataset to get a reasonable performance. So the knowledge learned in 2D CNNs is completely ignored. Another contribution in this work is a simple and effective technique to transfer knowledge from a pre-trained 2D CNN to a randomly initialized 3D CNN for a stable weight initialization. This allows us to significantly reduce the number of training samples for 3D CNNs. Thus, by fine-tuning this network, we beat the performance of generic and recent methods in 3D CNNs, which were trained on large video datasets, e.g. Sports-1M, and fine-tuned on the target datasets, e.g. HMDB51/UCF101.
引用
收藏
页码:299 / 315
页数:17
相关论文
共 43 条
[21]   Convolutional Two-Stream Network Fusion for Video Action Recognition [J].
Feichtenhofer, Christoph ;
Pinz, Axel ;
Zisserman, Andrew .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1933-1941
[22]  
Fernando B, 2015, PROC CVPR IEEE, P5378, DOI 10.1109/CVPR.2015.7299176
[23]   ActionVLAD: Learning spatio-temporal aggregation for action classification [J].
Girdhar, Rohit ;
Ramanan, Deva ;
Gupta, Abhinav ;
Sivic, Josef ;
Russell, Bryan .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :3165-3174
[24]   Cross Modal Distillation for Supervision Transfer [J].
Gupta, Saurabh ;
Hoffman, Judy ;
Malik, Jitendra .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :2827-2836
[25]   Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? [J].
Hara, Kensho ;
Kataoka, Hirokatsu ;
Satoh, Yutaka .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6546-6555
[26]  
Hinton G., 2015, arXiv preprint arXiv:1503.02531, V1050, P9
[27]  
Ioffe Sergey, 2015, PROC CVPR IEEE, P448, DOI DOI 10.1109/CVPR.2016.90
[28]   Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1026-1034
[29]   Large-scale Video Classification with Convolutional Neural Networks [J].
Karpathy, Andrej ;
Toderici, George ;
Shetty, Sanketh ;
Leung, Thomas ;
Sukthankar, Rahul ;
Fei-Fei, Li .
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :1725-1732
[30]  
Kuehne H, 2013, HIGH PERFORMANCE COMPUTING IN SCIENCE AND ENGINEERING '12: TRANSACTIONS OF THE HIGH PERFORMANCE COMPUTING CENTER, STUTTGART (HLRS) 2012, P571, DOI 10.1007/978-3-642-33374-3_41