Locality Guidance for Improving Vision Transformers on Tiny Datasets

被引:28
作者
Li, Kehan [1 ]
Yu, Runyi [1 ]
Wang, Zhennan [2 ]
Yuan, Li [1 ,2 ]
Song, Guoli [2 ]
Chen, Jie [1 ,2 ]
机构
[1] Peking Univ, Sch Elect & Comp Engn, Beijing, Peoples R China
[2] Peng Cheng Lab, Shenzhen, Peoples R China
来源
COMPUTER VISION, ECCV 2022, PT XXIV | 2022年 / 13684卷
关键词
D O I
10.1007/978-3-031-20053-3_7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While the Vision Transformer (VT) architecture is becoming trendy in computer vision, pure VT models perform poorly on tiny datasets. To address this issue, this paper proposes the locality guidance for improving the performance of VTs on tiny datasets. We first analyze that the local information, which is of great importance for understanding images, is hard to be learned with limited data due to the high flexibility and intrinsic globality of the self-attention mechanism in VTs. To facilitate local information, we realize the locality guidance for VTs by imitating the features of an already trained convolutional neural network (CNN), inspired by the built-in local-to-global hierarchy of CNN. Under our dual-task learning paradigm, the locality guidance provided by a lightweight CNN trained on low-resolution images is adequate to accelerate the convergence and improve the performance of VTs to a large extent. Therefore, our locality guidance approach is very simple and efficient, and can serve as a basic performance enhancement method for VTs on tiny datasets. Extensive experiments demonstrate that our method can significantly improve VTs when training from scratch on tiny datasets and is compatible with different kinds of VTs and datasets. For example, our proposed method can boost the performance of various VTs on tiny datasets (e.g., 13.07% for DeiT, 8.98% for T2T and 7.85% for PVT), and enhance even stronger baseline PVTv2 by 1.86% to 79.30%, showing the potential of VTs on tiny datasets. The code is available at https://github.com/lkhl/tiny- transformers.
引用
收藏
页码:110 / 127
页数:18
相关论文
共 55 条
[1]  
Abnar S, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P4190
[2]   Variational Information Distillation for Knowledge Transfer [J].
Ahn, Sungsoo ;
Hu, Shell Xu ;
Damianou, Andreas ;
Lawrence, Neil D. ;
Dai, Zhenwen .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :9155-9163
[3]   ViViT: A Video Vision Transformer [J].
Arnab, Anurag ;
Dehghani, Mostafa ;
Heigold, Georg ;
Sun, Chen ;
Lucic, Mario ;
Schmid, Cordelia .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826
[4]  
Brown TB, 2020, ADV NEUR IN, V33
[5]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[6]  
Changyong S, 2019, Arxiv, DOI arXiv:1904.05100
[7]   ConViT: improving vision transformers with soft convolutional inductive biases [J].
d'Ascoli, Stephane ;
Touvron, Hugo ;
Leavitt, Matthew L. ;
Morcos, Ari S. ;
Biroli, Giulio ;
Sagun, Levent .
JOURNAL OF STATISTICAL MECHANICS-THEORY AND EXPERIMENT, 2022, 2022 (11)
[8]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[9]  
Devlin J, 2019, Arxiv, DOI [arXiv:1810.04805, 10.48550/arXiv.1810.04805]
[10]  
Dosovitskiy A, 2021, Arxiv, DOI [arXiv:2010.11929, DOI 10.48550/ARXIV.2010.11929]