Add-Vit: CNN-Transformer Hybrid Architecture for Small Data Paradigm Processing

被引：3

作者：

Chen, Jinhui ^{[1
]}

Wu, Peng ^{[1
]}

Zhang, Xiaoming ^{[2
]}

Xu, Renjie ^{[3
]}

Liang, Jia ^{[1
]}

机构：

[1] Zhejiang Sci Tech Univ, Sch Mech Engn, 928 2nd St, Hangzhou 310018, Zhejiang, Peoples R China

[2] Army Acad Armored Forces, Dept Vehicle Engn, Dujiakan 21st,Fengtai, Beijing 100072, Peoples R China

[3] Army Acad Armored Forces, Performance & Training Ctr, Dujiakan 21st,Fengtai, Beijing 100072, Peoples R China

来源：

NEURAL PROCESSING LETTERS | 2024年 / 56卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Local feature; Vision transformer (ViT); Image classification; Small data paradigm;

D O I：

10.1007/s11063-024-11643-8

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The vision transformer(ViT), pre-trained on large datasets, outperforms convolutional neural networks (CNN) in computer vision(CV). However, if not pre-trained, the transformer architecture doesn't work well on small datasets and is surpassed by CNN. Through analysis, we found that:(1) the division and processing of tokens in the ViT discard the marginalized information between token. (2) the isolated multi-head self-attention (MSA) lacks prior knowledge. (3) the local inductive bias capability of stacked transformer block is much inferior to that of CNN. We propose a novel architecture for small data paradigms without pre-training, named Add-Vit, which uses progressive tokenization with feature supplementation in patch embedding. The model's representational ability is enhanced by using a convolutional prediction module shortcut to connect MSA and capture local features as additional representations of the token. Without the need for pre-training on large datasets, our best model achieved 81.25 % \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document} accuracy when trained from scratch on the CIFAR-100.

引用

页数：17