Multiscale Vision Transformers

被引:721
作者
Fan, Haoqi [1 ]
Xiong, Bo [1 ]
Mangalam, Karttikeya [1 ,2 ]
Li, Yanghao [1 ]
Yan, Zhicheng [1 ]
Malik, Jitendra [1 ,2 ]
Feichtenhofer, Christoph [1 ]
机构
[1] Facebook AI Res, Menlo Pk, CA 94025 USA
[2] Univ Calif Berkeley, Berkeley, CA USA
来源
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) | 2021年
关键词
D O I
10.1109/ICCV48922.2021.00675
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features. We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in computation and parameters. We further remove the temporal dimension and apply our model for image classification where it outperforms prior work on vision transformers.
引用
收藏
页码:6804 / 6815
页数:12
相关论文
共 122 条
[1]  
[Anonymous], PROC CVPR IEEE
[2]  
[Anonymous], 1983, READINGS COMPUTER VI
[3]  
[Anonymous], ECCV, DOI DOI 10.1007/978-3-030-90434-065-1
[4]  
[Anonymous], 2020, P ICML
[5]   ViViT: A Video Vision Transformer [J].
Arnab, Anurag ;
Dehghani, Mostafa ;
Heigold, Georg ;
Sun, Chen ;
Lucic, Mario ;
Schmid, Cordelia .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826
[6]  
Ba J.L., 2016, stat, VVolume 29, P3617, DOI 10.48550/arXiv.1607.06450
[7]  
Ba Jimmy Lei, 2016, ARXIV160706450
[8]  
Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, DOI 10.48550/ARXIV.1409.0473]
[9]  
Beal Josh, 2020, P 35 C NEURAL INFORM, DOI DOI 10.48550/ARXIV.2012.09958
[10]  
Bello I., 2021, INT C LEARNING REPRE