Efficient Transformer-Based Compressed Video Modeling via Informative Patch Selection

被引:2
作者
Suzuki, Tomoyuki [1 ]
Aoki, Yoshimitsu [1 ]
机构
[1] Keio Univ, Fac Sci & Technol, Dept Elect & Elect Engn, 3-14-1 Hiyoshi,Kohoku Ku, Yokohama, Kanagawa 2238522, Japan
关键词
video recognition; action recognition; transformer; compressed video; NETWORK;
D O I
10.3390/s23010244
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Recently, Transformer-based video recognition models have achieved state-of-the-art results on major video recognition benchmarks. However, their high inference cost significantly limits research speed and practical use. In video compression, methods considering small motions and residuals that are less informative and assigning short code lengths to them (e.g., MPEG4) have successfully reduced the redundancy of videos. Inspired by this idea, we propose Informative Patch Selection (IPS), which efficiently reduces the inference cost by excluding redundant patches from the input of the Transformer-based video model. The redundancy of each patch is calculated from motions and residuals obtained while decoding a compressed video. The proposed method is simple and effective in that it can dynamically reduce the inference cost depending on the input without any policy model or additional loss term. Extensive experiments on action recognition demonstrated that our method could significantly improve the trade-off between the accuracy and inference cost of the Transformer-based video model. Although the method does not require any policy model or additional loss term, its performance approaches that of existing methods that do require them.
引用
收藏
页数:23
相关论文
共 59 条
[1]  
[Anonymous], 2016, arXiv
[2]   ViViT: A Video Vision Transformer [J].
Arnab, Anurag ;
Dehghani, Mostafa ;
Heigold, Georg ;
Sun, Chen ;
Lucic, Mario ;
Schmid, Cordelia .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826
[3]  
Bertasius G, 2021, PR MACH LEARN RES, V139
[4]  
Bottou Leon., 2012, NEURAL NETWORKS TRIC, P421
[5]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[6]   Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts [J].
Changpinyo, Soravit ;
Sharma, Piyush ;
Ding, Nan ;
Soricut, Radu .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :3557-3567
[7]   MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition [J].
Chen, Jiawei ;
Ho, Chiu Man .
2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, :786-797
[8]  
Chen R.J., 2022, P IEEE C COMPUTER VI
[9]   TCSPANet: Two-Staged Contrastive Learning and Sub-Patch Attention Based Network for PolSAR Image Classification [J].
Cui, Yuanhao ;
Liu, Fang ;
Liu, Xu ;
Li, Lingling ;
Qian, Xiaoxue .
REMOTE SENSING, 2022, 14 (10)
[10]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848