TransMix: Attend to Mix for Vision Transformers

被引：39

作者：

Chen, Jie-Neng ^{[1
]}

Sun, Shuyang ^{[2
]}

He, Ju ^{[1
]}

Torr, Philip ^{[2
]}

Yuille, Alan ^{[1
]}

Bai, Song ^{[3
]}

机构：

[1] Johns Hopkins Univ, Baltimore, MD 21218 USA

[2] Univ Oxford, Oxford, England

[3] ByteDance Inc, Beijing, Peoples R China

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2022年

基金：

英国工程与自然科学研究理事会;

关键词：

D O I：

10.1109/CVPR52688.2022.01182

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Mixup-based augmentation has been found to be effective for generalizing models during training, especially for Vision Transformers (ViTs) since they can easily overfit. However, previous mixup-based methods have an underlying prior knowledge that the linearly interpolated ratio of targets should be kept the same as the ratio proposed in input interpolation. This may lead to a strange phenomenon that sometimes there is no valid object in the mixed image due to the random process in augmentation but there is still response in the label space. To bridge such gap between the input and label spaces, we propose TransMix, which mixes labels based on the attention maps of Vision Transformers. The confidence of the label will be larger if the corresponding input image is weighted higher by the attention map. TransMix is embarrassingly simple and can be implemented in just a few lines of code without introducing any extra parameters and FLOPs to ViT-based models. Experimental results show that our method can consistently improve various ViT-based models at scales on ImageNet classification. After pre-trained with TransMix on ImageNet, the ViT-based models also demonstrate better transferability to semantic segmentation, object detection and instance segmentation. TransMix also exhibits to be more robust when evaluating on 4 different benchmarks.

引用

页码：12125 / 12134

页数：10

共 53 条

[1]

[Anonymous], 2018, PROC EUR C COMPUT VI, DOI [DOI 10.1007/978-3-030-01234-2_49, 10.1007/978-3-030-01234-2_49]

[2]

[Anonymous], 2019, ICML

[3]

Bai Yutong, 2021, NEURIPS

[4]

Bao Hangbo, 2021, ICLR

[5] Emerging Properties in Self-Supervised Vision Transformers [J].

Caron, Mathilde ;

Touvron, Hugo ;

Misra, Ishan ;

Jegou, Herve ;

Mairal, Julien ;

Bojanowski, Piotr ;

Joulin, Armand .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9630-9640

[6]

Chen Chun-Fu, 2021, ICCV

[7]

Chen Jieneng, 2021, ARXIV210204306

[8]

Chen K., 2019, CoRR abs/1906.07155

[9] DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs [J].

Chen, Liang-Chieh ;

Papandreou, George ;

Kokkinos, Iasonas ;

Murphy, Kevin ;

Yuille, Alan L. .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (04) :834-848

[10] Randaugment: Practical automated data augmentation with a reduced search space [J].

Cubuk, Ekin D. ;

Zoph, Barret ;

Shlens, Jonathon ;

Le, Quoc, V .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, :3008-3017

← 1 2 3 4 5 6 →