Multiscale Vision Transformers

被引：721

作者：

Fan, Haoqi ^{[1
]}

Xiong, Bo ^{[1
]}

Mangalam, Karttikeya ^{[1
,2
]}

Li, Yanghao ^{[1
]}

Yan, Zhicheng ^{[1
]}

Malik, Jitendra ^{[1
,2
]}

Feichtenhofer, Christoph ^{[1
]}

机构：

[1] Facebook AI Res, Menlo Pk, CA 94025 USA

[2] Univ Calif Berkeley, Berkeley, CA USA

来源：

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) | 2021年

关键词：

D O I：

10.1109/ICCV48922.2021.00675

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features. We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in computation and parameters. We further remove the temporal dimension and apply our model for image classification where it outperforms prior work on vision transformers.

引用

页码：6804 / 6815

页数：12

共 122 条

[1]

[Anonymous], PROC CVPR IEEE

[2]

[Anonymous], 1983, READINGS COMPUTER VI

[3]

[Anonymous], ECCV, DOI DOI 10.1007/978-3-030-90434-065-1

[4]

[Anonymous], 2020, P ICML

[5] ViViT: A Video Vision Transformer [J].

Arnab, Anurag ;

Dehghani, Mostafa ;

Heigold, Georg ;

Sun, Chen ;

Lucic, Mario ;

Schmid, Cordelia .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826

[6]

Ba J.L., 2016, stat, VVolume 29, P3617, DOI 10.48550/arXiv.1607.06450

[7]

Ba Jimmy Lei, 2016, ARXIV160706450

[8]

Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, DOI 10.48550/ARXIV.1409.0473]

[9]

Beal Josh, 2020, P 35 C NEURAL INFORM, DOI DOI 10.48550/ARXIV.2012.09958

[10]

Bello I., 2021, INT C LEARNING REPRE

← 1 2 3 4 5 6 7 8 9 10 →