Fastformer: Transformer-Based Fast Reasoning Framework

被引:0
作者
Zhu, Wenjuan [1 ]
Guo, Ling [1 ]
Zhang, Tianxiang [1 ]
Han, Feng [1 ]
Wei, Yi [1 ]
Gong, Xiaoqing [1 ]
Xu, Pengfei [1 ]
Guo, Jing [1 ]
机构
[1] Northwest Univ, Kirkland, WA 98033 USA
来源
FOURTEENTH INTERNATIONAL CONFERENCE ON GRAPHICS AND IMAGE PROCESSING, ICGIP 2022 | 2022年 / 12705卷
基金
中国国家自然科学基金;
关键词
Action recognition; highway network; self-attention; transformer;
D O I
10.1117/12.2680430
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video action recognition is a vital task in the field of computer vision. A great deal of redundant information is generated along with original video data in the process of depth computation. In order to solve this problem, most existing methods improve recognition speed at the cost of recognition accuracy. In this paper, we propose a new framework: Fastformer which is a transformer-based structure for fast inference video classification to further improve model inference speed while maintaining accuracy. To achieve the balance of speed and accuracy, we solve the inter-frame and intra-frame redundancy of video and design a new self-attention network, which uses the improved highway network to make the model realize the same function as the traditional self-attention module, while greatly reducing the amount of calculation and the number of required parameters. We conduct experiments to verify the effect of our model. Overall, Fastformer significantly outperforms existing vision transformers with regard to the speed versus accuracy trade-off. For example, at 76.4% Keyframes-400 accuracy, Fastformer is 28% faster than TimeSformer.
引用
收藏
页数:10
相关论文
共 34 条
[1]  
Aksan E, 2021, Arxiv, DOI arXiv:2004.08692
[2]  
Arnab A., 2021, arXiv
[3]  
Bertasius G, 2021, Arxiv, DOI [arXiv:2102.05095, DOI 10.48550/ARXIV.2102.05095]
[4]  
Bhargava P, 2020, Arxiv, DOI arXiv:2005.07486
[5]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[6]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[7]  
Dong YH, 2021, PR MACH LEARN RES, V139
[8]  
Dosovitskiy A, 2021, Arxiv, DOI [arXiv:2010.11929, DOI 10.48550/ARXIV.2010.11929]
[9]   SlowFast Networks for Video Recognition [J].
Feichtenhofer, Christoph ;
Fan, Haoqi ;
Malik, Jitendra ;
He, Kaiming .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6201-6210
[10]   Webly-Supervised Video Recognition by Mutually Voting for Relevant Web Images and Web Video Frames [J].
Gan, Chuang ;
Sun, Chen ;
Duan, Lixin ;
Gong, Boqing .
COMPUTER VISION - ECCV 2016, PT III, 2016, 9907 :849-866