Expanding Language-Image Pretrained Models for General Video Recognition

被引:157
作者
Ni, Bolin [1 ,2 ]
Peng, Houwen [4 ]
Chen, Minghao
Zhang, Songyang [6 ]
Meng, Gaofeng [1 ,2 ,3 ]
Fu, Jianlong [4 ]
Xiang, Shiming [1 ,2 ]
Ling, Haibin [5 ]
机构
[1] Chinese Acad Sci, Inst Automat, NLPR, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[3] Chinese Acad Sci, HK Inst Sci & Innovat, CAIR, Hong Kong, Peoples R China
[4] Microsoft Res Asia, Beijing, Peoples R China
[5] SUNY Stony Brook, Stony Brook, NY USA
[6] Univ Rochester, Rochester, MI USA
来源
COMPUTER VISION - ECCV 2022, PT IV | 2022年 / 13664卷
基金
中国国家自然科学基金;
关键词
Video recognition; Contrastive language-image pretraining;
D O I
10.1007/978-3-031-19772-7_1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable "zero-shot" generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to video domains is still an open problem. In this work, we present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly, instead of pretraining a new model from scratch. More concretely, to capture the long-range dependencies of frames along the temporal dimension, we propose a cross-frame attention mechanism that explicitly exchanges information across frames. Such module is lightweight and can be plugged into pretrained language-image models seamlessly. Moreover, we propose a video-specific prompting scheme, which leverages video content information for generating discriminative textual prompts. Extensive experiments demonstrate that our approach is effective and can be generalized to different video recognition scenarios. In particular, under fully-supervised settings, our approach achieves a top-1 accuracy of 87.1% on Kinectics-400, while using 12 x fewer FLOPs compared with Swin-L and ViViT-H. In zero-shot experiments, our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols. In few-shot scenarios, our approach outperforms previous best methods by +32.1% and +23.1% when the labeled data is extremely limited. Code and models are available at here.
引用
收藏
页码:1 / 18
页数:18
相关论文
共 63 条
[1]   Label-Embedding for Image Classification [J].
Akata, Zeynep ;
Perronnin, Florent ;
Harchaoui, Zaid ;
Schmid, Cordelia .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2016, 38 (07) :1425-1438
[2]  
Akata Z, 2015, PROC CVPR IEEE, P2927, DOI 10.1109/CVPR.2015.7298911
[3]   ViViT: A Video Vision Transformer [J].
Arnab, Anurag ;
Dehghani, Mostafa ;
Heigold, Georg ;
Sun, Chen ;
Lucic, Mario ;
Schmid, Cordelia .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826
[4]  
Ba J. L., 2016, arXiv, DOI 10.48550/arXiv:1607.06450
[5]  
Bertasius G, 2021, PR MACH LEARN RES, V139
[6]   Rethinking Zero-shot Video Classification: End-to-end Training for Realistic Applications [J].
Brattoli, Biagio ;
Tighe, Joseph ;
Zhdanov, Fedor ;
Perona, Pietro ;
Chalupka, Krzysztof .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :4612-4622
[7]  
Carreira J., 2018, arXiv
[8]   Elaborative Rehearsal for Zero-shot Action Recognition [J].
Chen, Shizhe ;
Huang, Dong .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :13618-13627
[9]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[10]  
Dosovitskiy A., 2021, P 9 INT C LEARN REPR