Multiscale Vision Transformers

被引:574
作者
Fan, Haoqi [1 ]
Xiong, Bo [1 ]
Mangalam, Karttikeya [1 ,2 ]
Li, Yanghao [1 ]
Yan, Zhicheng [1 ]
Malik, Jitendra [1 ,2 ]
Feichtenhofer, Christoph [1 ]
机构
[1] Facebook AI Res, Menlo Pk, CA 94025 USA
[2] Univ Calif Berkeley, Berkeley, CA USA
来源
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) | 2021年
关键词
D O I
10.1109/ICCV48922.2021.00675
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features. We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in computation and parameters. We further remove the temporal dimension and apply our model for image classification where it outperforms prior work on vision transformers.
引用
收藏
页码:6804 / 6815
页数:12
相关论文
共 122 条
  • [41] Video Action Transformer Network
    Girdhar, Rohit
    Carreira, Joao
    Doersch, Carl
    Zisserman, Andrew
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 244 - 253
  • [42] Goyal P., 2017, CoRR
  • [43] The "something something" video database for learning and evaluating visual common sense
    Goyal, Raghav
    Kahou, Samira Ebrahimi
    Michalski, Vincent
    Materzynska, Joanna
    Westphal, Susanne
    Kim, Heuna
    Haenel, Valentin
    Fruend, Ingo
    Yianilos, Peter
    Mueller-Freitag, Moritz
    Hoppe, Florian
    Thurau, Christian
    Bax, Ingo
    Memisevic, Roland
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 5843 - 5851
  • [44] AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
    Gu, Chunhui
    Sun, Chen
    Ross, David A.
    Vondrick, Carl
    Pantofaru, Caroline
    Li, Yeqing
    Vijayanarasimhan, Sudheendra
    Toderici, George
    Ricco, Susanna
    Sukthankar, Rahul
    Schmid, Cordelia
    Malik, Jitendra
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6047 - 6056
  • [45] Guo Meng-Hao, 2020, ARXIV201209688
  • [46] Hang Zhang, 2020, RESNEST SPLIT ATTENT
  • [47] Hanin B, 2018, ADV NEUR IN, V31
  • [48] He K., 2017, ICCV, P2961
  • [49] Identity Mappings in Deep Residual Networks
    He, Kaiming
    Zhang, Xiangyu
    Ren, Shaoqing
    Sun, Jian
    [J]. COMPUTER VISION - ECCV 2016, PT IV, 2016, 9908 : 630 - 645
  • [50] He Kaiming, 2015, C COMP VIS PATT REC