LAE-Net: Light and Efficient Network for Compressed Video Action Recognition

被引:2
作者
Guo, Jinxin [1 ]
Zhang, Jiaqiang [1 ]
Zhang, Xiaojing [1 ]
Ma, Ming [1 ]
机构
[1] Inner Mongolia Univ, Hohhot, Peoples R China
来源
MULTIMEDIA MODELING, MMM 2023, PT II | 2023年 / 13834卷
关键词
Action recognition; Compressed video; Transfer learning;
D O I
10.1007/978-3-031-27818-1_22
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Action recognition is a crucial task in computer vision and video analysis. The Two-stream network and 3D ConvNets are representative works. Although both of them have achieved outstanding performance, the optical flow and 3D convolution require huge computational effort, without taking into account the need for real-time applications. Current work extracts motion vectors and residuals directly from the compressed video to replace optical flow. However, due to the noisy and inaccurate representation of the motion, the accuracy of the model is significantly decreased when using motion vectors as input. Besides the current works focus only on improving accuracy or reducing computational cost, without exploring the tradeoff strategy between them. In this paper, we propose a light and efficient multi-stream framework, including a motion temporal fusion module (MTFM) and a double compressed knowledge distillation module (DCKD). MTFM improves the network's ability to extract complete motion information and compensates to some extent for the problem of inaccurate description of motion information by motion vectors in compressed video. DCKD allows the student network to gain more knowledge from teacher with less parameters and input frames. Experimental results on the two public benchmarks(UCF-101 and HMDB-51) outperform the state of the art on the compressed domain.
引用
收藏
页码:265 / 276
页数:12
相关论文
共 27 条
[1]   Mimic The Raw Domain: Accelerating Action Recognition in the Compressed Domain [J].
Battash, Barak ;
Barad, Haim ;
Tang, Hanlin ;
Bleiweiss, Amit .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, :2926-2934
[2]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[3]   MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition [J].
Chen, Jiawei ;
Ho, Chiu Man .
2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, :786-797
[4]  
Diba A, 2017, Arxiv, DOI arXiv:1711.08200
[5]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[6]   Convolutional Two-Stream Network Fusion for Video Action Recognition [J].
Feichtenhofer, Christoph ;
Pinz, Axel ;
Zisserman, Andrew .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1933-1941
[7]   ActionVLAD: Learning spatio-temporal aggregation for action classification [J].
Girdhar, Rohit ;
Ramanan, Deva ;
Gupta, Abhinav ;
Sivic, Josef ;
Russell, Bryan .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :3165-3174
[8]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[9]  
Huo Y., 2019, arXiv
[10]   Efficient feature extraction, encoding and classification for action recognition [J].
Kantorov, Vadim ;
Laptev, Ivan .
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :2593-2600