Multiscale Vision Transformers

被引:721
作者
Fan, Haoqi [1 ]
Xiong, Bo [1 ]
Mangalam, Karttikeya [1 ,2 ]
Li, Yanghao [1 ]
Yan, Zhicheng [1 ]
Malik, Jitendra [1 ,2 ]
Feichtenhofer, Christoph [1 ]
机构
[1] Facebook AI Res, Menlo Pk, CA 94025 USA
[2] Univ Calif Berkeley, Berkeley, CA USA
来源
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) | 2021年
关键词
D O I
10.1109/ICCV48922.2021.00675
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features. We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in computation and parameters. We further remove the temporal dimension and apply our model for image classification where it outperforms prior work on vision transformers.
引用
收藏
页码:6804 / 6815
页数:12
相关论文
共 122 条
[41]  
Gabeur Valentin, 2020, P ECCV, V5
[42]   Video Action Transformer Network [J].
Girdhar, Rohit ;
Carreira, Joao ;
Doersch, Carl ;
Zisserman, Andrew .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :244-253
[43]  
Goyal Priya, 2017, CORR
[44]   The "something something" video database for learning and evaluating visual common sense [J].
Goyal, Raghav ;
Kahou, Samira Ebrahimi ;
Michalski, Vincent ;
Materzynska, Joanna ;
Westphal, Susanne ;
Kim, Heuna ;
Haenel, Valentin ;
Fruend, Ingo ;
Yianilos, Peter ;
Mueller-Freitag, Moritz ;
Hoppe, Florian ;
Thurau, Christian ;
Bax, Ingo ;
Memisevic, Roland .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5843-5851
[45]   AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions [J].
Gu, Chunhui ;
Sun, Chen ;
Ross, David A. ;
Vondrick, Carl ;
Pantofaru, Caroline ;
Li, Yeqing ;
Vijayanarasimhan, Sudheendra ;
Toderici, George ;
Ricco, Susanna ;
Sukthankar, Rahul ;
Schmid, Cordelia ;
Malik, Jitendra .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6047-6056
[46]  
Guo M.-H., 2020, ARXIV201209688
[47]  
Hang Zhang, 2020, RESNEST SPLIT ATTENT
[48]  
Hanin B, 2018, ADV NEUR IN, V31
[49]  
He K., 2015, C COMPUTER VISION PA, DOI DOI 10.1109/CVPR.2016.90
[50]  
He K., 2017, PROC IEEE INT C COMP, P2961