Add-Vit: CNN-Transformer Hybrid Architecture for Small Data Paradigm Processing

被引:3
作者
Chen, Jinhui [1 ]
Wu, Peng [1 ]
Zhang, Xiaoming [2 ]
Xu, Renjie [3 ]
Liang, Jia [1 ]
机构
[1] Zhejiang Sci Tech Univ, Sch Mech Engn, 928 2nd St, Hangzhou 310018, Zhejiang, Peoples R China
[2] Army Acad Armored Forces, Dept Vehicle Engn, Dujiakan 21st,Fengtai, Beijing 100072, Peoples R China
[3] Army Acad Armored Forces, Performance & Training Ctr, Dujiakan 21st,Fengtai, Beijing 100072, Peoples R China
基金
中国国家自然科学基金;
关键词
Local feature; Vision transformer (ViT); Image classification; Small data paradigm;
D O I
10.1007/s11063-024-11643-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The vision transformer(ViT), pre-trained on large datasets, outperforms convolutional neural networks (CNN) in computer vision(CV). However, if not pre-trained, the transformer architecture doesn't work well on small datasets and is surpassed by CNN. Through analysis, we found that:(1) the division and processing of tokens in the ViT discard the marginalized information between token. (2) the isolated multi-head self-attention (MSA) lacks prior knowledge. (3) the local inductive bias capability of stacked transformer block is much inferior to that of CNN. We propose a novel architecture for small data paradigms without pre-training, named Add-Vit, which uses progressive tokenization with feature supplementation in patch embedding. The model's representational ability is enhanced by using a convolutional prediction module shortcut to connect MSA and capture local features as additional representations of the token. Without the need for pre-training on large datasets, our best model achieved 81.25 % \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document} accuracy when trained from scratch on the CIFAR-100.
引用
收藏
页数:17
相关论文
empty
未找到相关数据