TransMCGC: a recast vision transformer for small-scale image classification tasks

被引:0
作者
Jian-Wen Xiang
Min-Rong Chen
Pei-Shan Li
Hao-Li Zou
Shi-Da Li
Jun-Jie Huang
机构
[1] South China Normal University,School of Computer Science
[2] Jinan University,College of Cyber Security and the National Joint Engineering Research Center of Network Security Detection and Protection Technology
来源
Neural Computing and Applications | 2023年 / 35卷
关键词
Vision transformer; Convolution; Multi-head self-attention; Stage;
D O I
暂无
中图分类号
学科分类号
摘要
Multi-stage hierarchical structure is a basic and effective design pattern in convolution neural networks (CNNs). Recently, Vision Transformers (ViTs) have achieved impressive performance as a new architecture for various vision tasks. However, many unknown properties of ViTs need to be further explored. In this paper, we empirically find that despite having no explicit multi-stage hierarchical design like CNNs, ViT models are able to automatically organize layers into stages (or blockgroups) to gradually extract different levels of feature information. Moreover, ViT models organize more highly similar Transformer blocks in the last stage, where the multi-head self-attention becomes less effective to learn useful concepts for feature learning and thus may limit the model to get the expected performance gain. To this end, we further recast a new ViT framework, named TransMCGC, replacing the inefficient Transformer blocks in the last stage of Vision Transformer with the proposed convolutional operation MCGC blocks. The MCGC block consists of two sub-modules in parallel: Multi-branch Convolution module to integrate local neighborhood features and multi-scale context information, and Global Context module to capture global dependencies with negligible parameters. In this way, the proposed MCGC block integrates collaboratively convolution locality and global dependencies to enhance the feature learning ability of the model. Finally, extensive experiments on six standard small-scale benchmark datasets, including CIFAR10, CIFAR100, Stanford Cars, Oxford102flowers, DTD and Food101, demonstrate the effectiveness of the proposed MCGC block and indicate that our TransMCGC models achieve better performance over baseline model ViT, while achieving competitive performance compared to state-of-the-art ViT variants.
引用
收藏
页码:7697 / 7718
页数:21
相关论文
共 50 条
[1]  
Hubel DH(1962)Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex J Physiol 160 106-154
[2]  
Wiesel TN(1998)Gradient-based learning applied to document recognition Proc IEEE 86 2278-2324
[3]  
Lecun Y(2012)Imagenet classification with deep convolutional neural networks Commun ACM 60 84-90
[4]  
Bottou L(2015)Deep learning in neural networks: an overview Neural Netw 61 85-117
[5]  
Bengio Y(2018)How convolutional neural networks see the world—a survey of convolutional neural network visualization methods Math Found Comput 1 149-180
[6]  
Haffner P(2021)On the correlation between human fixations, handcrafted and CNN features Neural Comput Appl 33 11905-11922
[7]  
Krizhevsky A(2020)A self-attention-based destruction and construction learning fine-grained image classification method for retail product recognition Neural Comput Appl 1 1-10
[8]  
Sutskever I(2019)Nonlinear CNN: improving CNNs with quadratic convolutions Neural Comput Appl 32 8507-8516
[9]  
Hinton GE(2019)Context-aware attention network for image recognition Neural Comput Appl 31 9295-9305
[10]  
Schmidhuber J(2015)Imagenet large scale visual recognition challenge Int J Comput Vision 115 211-252