HELViT: highly efficient lightweight vision transformer for remote sensing image scene classification

被引：2

作者：

Guo, Dongen ^{[1
]}

Wu, Zechen ^{[1
]}

Feng, Jiangfan ^{[2
]}

Zhou, Zhuoke ^{[1
]}

Shen, Zhen ^{[1
]}

机构：

[1] Nanyang Inst Technol, Sch Comp & Software, 80 Changjiang Rd, Nanyang 473004, Henan, Peoples R China

[2] Chongqing Univ Posts & Telecommun, Chongqing Engn Res Ctr Spatial Big Data Intelligen, 2 Chongwen Rd, Chongqing 400065, Peoples R China

来源：

APPLIED INTELLIGENCE | 2023年 / 53卷 / 21期

基金：

中国国家自然科学基金;

关键词：

Remote sensing image; Scene classification; Highly efficient lightweight model; Adaptive token merging; Fast multi-head self attention; Vision transformer (Vi);

D O I：

10.1007/s10489-023-04725-y

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Remote sensing image scene classification methods based on convolutional neural networks (CNN) have been extremely successful. However, the limitations of CNN itself make it difficult to acquire global information. The traditional Vision Transformer can effectively capture long-distance dependencies for acquiring global information, but it is computationally intensive. In addition, each class of scene in remote sensing images has a large quantity of the similar background or foreground features. To effectively leverage those similar features and reduce the computation, a highly efficient lightweight vision transformer (HELViT) is proposed. HELViT is a hybrid model combining CNN and Transformer and consists of the Convolution and Attention Block (CAB), the Convolution and Token Merging Block (CTMB). Specifically, in CAB module, the embedding layer in the original Vision Transformer is replaced with a modified MBConv (MBConv(*)), and the Fast Multi-Head Self Attention (F-MHSA) is used to change the quadratic complexity of the self-attention mechanism to linear. To further decreasing the model's computational cost, CTMB employs the adaptive token merging (ATOME) to fuse some related foreground or background features. The experimental results on the UCM, AID and NWPU datasets show that the proposed model displays better results in terms of accuracy and efficiency than the state-of-the-art remote sensing scene classification methods. On the most challenging NWPU dataset, HELViT achieves the highest accuracy of 94.64%/96.84% with 4.6G GMACs for 10%/20% training samples, respectively.

引用

页码：24947 / 24962

页数：16

共 36 条

[1] Vision Transformers for Remote Sensing Image Classification
Bazi, Yakoub
Bashmal, Laila
Rahhal, Mohamad M. Al
Dayil, Reham Al
Ajlan, Naif Al
[J]. REMOTE SENSING, 2021, 13 (03) : 1 - 20
[2] Local Semantic Enhanced ConvNet for Aerial Scene Recognition
Bi, Qi
Qin, Kun
Zhang, Han
Xia, Gui-Song
[J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 6498 - 6511
[3] Multi-scale stacking attention pooling for remote sensing scene classification
Bi, Qi
Zhang, Han
Qin, Kun
[J]. NEUROCOMPUTING, 2021, 436 : 147 - 161
[4] APDC-Net: Attention Pooling-Based Convolutional Network for Aerial Scene Classification
Bi, Qi
Qin, Kun
Zhang, Han
Xie, Jiafen
Li, Zhili
Xu, Kai
[J]. IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2020, 17 (09) : 1603 - 1607
[5] A Multiple-Instance Densely-Connected ConvNet for Aerial Scene Classification
Bi, Qi
Qin, Kun
Li, Zhili
Zhang, Han
Xu, Kai
Xia, Gui-Song
[J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 4911 - 4926
[6] RADC-Net: A residual attention based convolution network for aerial scene classification
Bi, Qi
Qin, Kun
Zhang, Han
Li, Zhili
Xu, Kai
[J]. NEUROCOMPUTING, 2020, 377 : 345 - 359
[7] Bolya D., 2023, INT C LEARNING REPRE
[8] Bolya D, 2022, P EUROPEAN C COMPUTE, P35, DOI DOI 10.1007/978-3-031-25082-83
[9] When Deep Learning Meets Metric Learning: Remote Sensing Image Scene Classification via Learning Discriminative CNNs
Cheng, Gong
Yang, Ceyuan
Yao, Xiwen
Guo, Lei
Han, Junwei
[J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2018, 56 (05): : 2811 - 2821
[10] Remote Sensing Image Scene Classification: Benchmark and State of the Art
Cheng, Gong
Han, Junwei
Lu, Xiaoqiang
[J]. PROCEEDINGS OF THE IEEE, 2017, 105 (10) : 1865 - 1883

← 1 2 3 4 →