HAT: A Visual Transformer Model for Image Recognition Based on Hierarchical Attention Transformation

被引：0

作者：

Zhao, Xuanyu ^{[1
]}

Hu, Tao ^{[1
,2
,3
]}

Mao, Chunxia ^{[1
]}

Yuan, Ye ^{[4
]}

Li, Jun ^{[1
]}

机构：

[1] Hubei Minzu Univ, Coll Intelligent Syst Sci & Engn, Enshi 445000, Peoples R China

[2] Hubei Minzu Univ, Hubei Engn Res Ctr Selenium Food Nutr & Hlth Intel, Enshi 445000, Peoples R China

[3] Minist Culture & Tourism, Key Lab Performing Art Equipment & Syst Technol, Beijing 100007, Peoples R China

[4] Enshi Audit Off, Dept Gastroenterol, Enshi 445000, Peoples R China

来源：

IEEE ACCESS | 2023年 / 11卷

基金：

中国国家自然科学基金;

关键词：

Visual transformer; attention transfer mechanism; hierarchical network; image feature; image recognition;

D O I：

10.1109/ACCESS.2023.3314573

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In the field of image recognition, Visual Transformer (ViT) has excellent performance. However, ViT, relies on a fixed self-attentive layer, tends to lead to computational redundancy and makes it difficult to maintain the integrity of the image convolutional feature sequence during the training process. Therefore, we proposed a non-normalization hierarchical attention transfer network (HAT), which introduces threshold attention mechanism and multi head attention mechanism after pooling in each layer. The focus of HAT is shifted between local and global, thus flexibly controlling the attention range of image classification. The HAT used the smaller computational complexity to improve it's scalability, which enables it to handle longer feature sequences and balance efficiency and accuracy. HAT removes layer normalization to increase the likelihood of convergence to an optimal level during training. In order to verify the effectiveness of the proposed model, we conducted experiments on image classification and segmentation tasks. The results shows that compared with classical pyramid structured networks and different attention networks, HAT outperformed the benchmark networks on both ImageNet and CIFAR100 datasets.

引用

页码：100042 / 100051

页数：10

共 30 条

[1]

[Anonymous], IEEE Trans. Pattern Anal. Mach. Intell., V35

[2]

[Anonymous], 2021, P ADV NEUR INF PROC, V34

[3]

[Anonymous], Image Vis. Comput., V137

[4]

[Anonymous], 2022, P FIND ASS COMP LING

[5]

[Anonymous], IEEE Trans. Pattern Anal. Mach. Intell., V45

[6]

[Anonymous], 2021, self-attention for local-global interactions in vision transformers

[7]

[Anonymous], 2022, P INT C LEARN REPR

[8] Generalizing Adversarial Explanations with Grad-CAM [J].

Chakraborty, Tanmay ;

Trehan, Utkarsh ;

Mallat, Khawla ;

Dugelay, Jean-Luc .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, :186-192

[9]

Chu XX, 2021, ADV NEUR IN

[10] ConViT: improving vision transformers with soft convolutional inductive biases [J].

d'Ascoli, Stephane ;

Touvron, Hugo ;

Leavitt, Matthew L. ;

Morcos, Ari S. ;

Biroli, Giulio ;

Sagun, Levent .

JOURNAL OF STATISTICAL MECHANICS-THEORY AND EXPERIMENT, 2022, 2022 (11)

← 1 2 3 →