Multiscale Vision Transformers

被引：721

作者：

Fan, Haoqi ^{[1
]}

Xiong, Bo ^{[1
]}

Mangalam, Karttikeya ^{[1
,2
]}

Li, Yanghao ^{[1
]}

Yan, Zhicheng ^{[1
]}

Malik, Jitendra ^{[1
,2
]}

Feichtenhofer, Christoph ^{[1
]}

机构：

[1] Facebook AI Res, Menlo Pk, CA 94025 USA

[2] Univ Calif Berkeley, Berkeley, CA USA

来源：

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) | 2021年

关键词：

D O I：

10.1109/ICCV48922.2021.00675

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features. We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in computation and parameters. We further remove the temporal dimension and apply our model for image classification where it outperforms prior work on vision transformers.

引用

页码：6804 / 6815

页数：12

共 122 条

[41]

Gabeur Valentin, 2020, P ECCV, V5

[42] Video Action Transformer Network [J].

Girdhar, Rohit ;

Carreira, Joao ;

Doersch, Carl ;

Zisserman, Andrew .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :244-253

[43]

Goyal Priya, 2017, CORR

[44] The "something something" video database for learning and evaluating visual common sense [J].

Goyal, Raghav ;

Kahou, Samira Ebrahimi ;

Michalski, Vincent ;

Materzynska, Joanna ;

Westphal, Susanne ;

Kim, Heuna ;

Haenel, Valentin ;

Fruend, Ingo ;

Yianilos, Peter ;

Mueller-Freitag, Moritz ;

Hoppe, Florian ;

Thurau, Christian ;

Bax, Ingo ;

Memisevic, Roland .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5843-5851

[45] AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions [J].

Gu, Chunhui ;

Sun, Chen ;

Ross, David A. ;

Vondrick, Carl ;

Pantofaru, Caroline ;

Li, Yeqing ;

Vijayanarasimhan, Sudheendra ;

Toderici, George ;

Ricco, Susanna ;

Sukthankar, Rahul ;

Schmid, Cordelia ;

Malik, Jitendra .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6047-6056

[46]

Guo M.-H., 2020, ARXIV201209688

[47]

Hang Zhang, 2020, RESNEST SPLIT ATTENT

[48]

Hanin B, 2018, ADV NEUR IN, V31

[49]

He K., 2015, C COMPUTER VISION PA, DOI DOI 10.1109/CVPR.2016.90

[50]

He K., 2017, PROC IEEE INT C COMP, P2961

← 1 2 3 4 5 6 7 8 9 10 →