FastPicker: Adaptive independent two-stage video-to-video summarization for efficient action recognition

被引:11
作者
Alfasly, Saghir [1 ,2 ]
Lu, Jian [1 ,3 ]
Xu, Chen [1 ,2 ]
Al-Huda, Zaid [4 ]
Jiang, Qingtang [5 ]
Lu, Zhaosong [6 ]
Chui, Charles K. [7 ]
机构
[1] Shenzhen Univ, Coll Math & Stat, Shenzhen Key Lab Adv Machine Learning & Applicat, Shenzhen 518060, Peoples R China
[2] Guangdong Key Lab Intelligent Informat Proc, Shenzhen 518060, Peoples R China
[3] Pazhou Lab, Guangzhou 510335, Peoples R China
[4] Southwest Jiaotong Univ, Sch Comp & Artificial Intelligent, Chengdu 610031, Sichuan, Peoples R China
[5] Univ Missouri, Dept Math & Stat, St Louis, MO USA
[6] Univ Minnesota, Dept Ind & Syst Engn, Minneapolis, MN USA
[7] Stanford Univ, Dept Stat, Stanford, CA 94305 USA
基金
中国国家自然科学基金;
关键词
Action recognition; Video -to -video summarization; Discriminative frame selection; Representative frame selection; Deep learning; NETWORK;
D O I
10.1016/j.neucom.2022.10.037
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video datasets suffer from huge inter-frame redundancy, which prevents deep networks from learning effectively and increases computational costs. Therefore, several methods adopt random/uniform frame sampling or key-frame selection techniques. Unfortunately, most of the learnable frame selection methods are customized for specific models and lack generality, independence, and scalability. In this paper, we propose a novel two-stage video-to-video summarization method termed FastPicker, which can efficiently select the most discriminative and representative frames for better action recognition. Independently, the discriminative frames are selected in the first stage based on the inter-frame motion computation, whereas the representative frames are selected in the second stage using a novel Transformer-based model. Learnable frame embeddings are proposed to estimate each frame contribution to the final video classification certainty. Consequently, the frames with the largest contributions are the most representative. The proposed method is carefully evaluated by summarizing several action recognition datasets and using them to train various deep models with several backbones. The experimental results demonstrate a remarkable performance boost on Kinetics400, Something-Something-v2, ActivityNet-1.3, UCF-101, and HMDB51 datasets, e.g., FastPicker downsizes Kinetics400 by 78.7% of its size while improving the human activity recognition.
引用
收藏
页码:231 / 244
页数:14
相关论文
共 63 条
[11]  
Clement J., 2019, Hours of video uploaded to youtube every minute as of may 2019
[12]   Summarization of Egocentric Videos: A Comprehensive Survey [J].
del Molino, Ana Garcia ;
Tan, Cheston ;
Lim, Joo-Hwee ;
Tan, Ah-Hwee .
IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, 2017, 47 (01) :65-76
[13]  
Dong WK, 2019, AAAI CONF ARTIF INTE, P8247
[14]  
Dosovitskiy A., 2021, 2021 INT C LEARN REP, P2021
[15]   Recurrent Spatial-Temporal Attention Network for Action Recognition in Videos [J].
Du, Wenbin ;
Wang, Yali ;
Qiao, Yu .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2018, 27 (03) :1347-1360
[16]  
Fan HH, 2018, PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P705
[17]   X3D: Expanding Architectures for Efficient Video Recognition [J].
Feichtenhofer, Christoph .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :200-210
[18]   SlowFast Networks for Video Recognition [J].
Feichtenhofer, Christoph ;
Fan, Haoqi ;
Malik, Jitendra ;
He, Kaiming .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6201-6210
[19]  
Gowda SN, 2021, AAAI CONF ARTIF INTE, V35, P1451
[20]   The "something something" video database for learning and evaluating visual common sense [J].
Goyal, Raghav ;
Kahou, Samira Ebrahimi ;
Michalski, Vincent ;
Materzynska, Joanna ;
Westphal, Susanne ;
Kim, Heuna ;
Haenel, Valentin ;
Fruend, Ingo ;
Yianilos, Peter ;
Mueller-Freitag, Moritz ;
Hoppe, Florian ;
Thurau, Christian ;
Bax, Ingo ;
Memisevic, Roland .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5843-5851