Multiscale Vision Transformers

被引:574
作者
Fan, Haoqi [1 ]
Xiong, Bo [1 ]
Mangalam, Karttikeya [1 ,2 ]
Li, Yanghao [1 ]
Yan, Zhicheng [1 ]
Malik, Jitendra [1 ,2 ]
Feichtenhofer, Christoph [1 ]
机构
[1] Facebook AI Res, Menlo Pk, CA 94025 USA
[2] Univ Calif Berkeley, Berkeley, CA USA
来源
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) | 2021年
关键词
D O I
10.1109/ICCV48922.2021.00675
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features. We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in computation and parameters. We further remove the temporal dimension and apply our model for image classification where it outperforms prior work on vision transformers.
引用
收藏
页码:6804 / 6815
页数:12
相关论文
共 122 条
  • [1] [Anonymous], PROC CVPR IEEE
  • [2] [Anonymous], 2018, ARXIV180703848
  • [3] [Anonymous], 1983, READINGS COMPUTER VI
  • [4] [Anonymous], 2020, P ICML
  • [5] ViViT: A Video Vision Transformer
    Arnab, Anurag
    Dehghani, Mostafa
    Heigold, Georg
    Sun, Chen
    Lucic, Mario
    Schmid, Cordelia
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6816 - 6826
  • [6] Ba J., 2016, ARXIV160706450, V1050, P21
  • [7] Ba Jimmy Lei, 2016, ARXIV160706450
  • [8] Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
  • [9] Beal J., 2020, Toward Transformer-Based Object Detection
  • [10] Attention Augmented Convolutional Networks
    Bello, Irwan
    Zoph, Barret
    Vaswani, Ashish
    Shlens, Jonathon
    Le, Quoc V.
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 3285 - 3294