HELViT: highly efficient lightweight vision transformer for remote sensing image scene classification

被引:2
作者
Guo, Dongen [1 ]
Wu, Zechen [1 ]
Feng, Jiangfan [2 ]
Zhou, Zhuoke [1 ]
Shen, Zhen [1 ]
机构
[1] Nanyang Inst Technol, Sch Comp & Software, 80 Changjiang Rd, Nanyang 473004, Henan, Peoples R China
[2] Chongqing Univ Posts & Telecommun, Chongqing Engn Res Ctr Spatial Big Data Intelligen, 2 Chongwen Rd, Chongqing 400065, Peoples R China
基金
中国国家自然科学基金;
关键词
Remote sensing image; Scene classification; Highly efficient lightweight model; Adaptive token merging; Fast multi-head self attention; Vision transformer (Vi);
D O I
10.1007/s10489-023-04725-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Remote sensing image scene classification methods based on convolutional neural networks (CNN) have been extremely successful. However, the limitations of CNN itself make it difficult to acquire global information. The traditional Vision Transformer can effectively capture long-distance dependencies for acquiring global information, but it is computationally intensive. In addition, each class of scene in remote sensing images has a large quantity of the similar background or foreground features. To effectively leverage those similar features and reduce the computation, a highly efficient lightweight vision transformer (HELViT) is proposed. HELViT is a hybrid model combining CNN and Transformer and consists of the Convolution and Attention Block (CAB), the Convolution and Token Merging Block (CTMB). Specifically, in CAB module, the embedding layer in the original Vision Transformer is replaced with a modified MBConv (MBConv(*)), and the Fast Multi-Head Self Attention (F-MHSA) is used to change the quadratic complexity of the self-attention mechanism to linear. To further decreasing the model's computational cost, CTMB employs the adaptive token merging (ATOME) to fuse some related foreground or background features. The experimental results on the UCM, AID and NWPU datasets show that the proposed model displays better results in terms of accuracy and efficiency than the state-of-the-art remote sensing scene classification methods. On the most challenging NWPU dataset, HELViT achieves the highest accuracy of 94.64%/96.84% with 4.6G GMACs for 10%/20% training samples, respectively.
引用
收藏
页码:24947 / 24962
页数:16
相关论文
共 36 条
  • [1] Vision Transformers for Remote Sensing Image Classification
    Bazi, Yakoub
    Bashmal, Laila
    Rahhal, Mohamad M. Al
    Dayil, Reham Al
    Ajlan, Naif Al
    [J]. REMOTE SENSING, 2021, 13 (03) : 1 - 20
  • [2] Local Semantic Enhanced ConvNet for Aerial Scene Recognition
    Bi, Qi
    Qin, Kun
    Zhang, Han
    Xia, Gui-Song
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 6498 - 6511
  • [3] Multi-scale stacking attention pooling for remote sensing scene classification
    Bi, Qi
    Zhang, Han
    Qin, Kun
    [J]. NEUROCOMPUTING, 2021, 436 : 147 - 161
  • [4] APDC-Net: Attention Pooling-Based Convolutional Network for Aerial Scene Classification
    Bi, Qi
    Qin, Kun
    Zhang, Han
    Xie, Jiafen
    Li, Zhili
    Xu, Kai
    [J]. IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2020, 17 (09) : 1603 - 1607
  • [5] A Multiple-Instance Densely-Connected ConvNet for Aerial Scene Classification
    Bi, Qi
    Qin, Kun
    Li, Zhili
    Zhang, Han
    Xu, Kai
    Xia, Gui-Song
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 4911 - 4926
  • [6] RADC-Net: A residual attention based convolution network for aerial scene classification
    Bi, Qi
    Qin, Kun
    Zhang, Han
    Li, Zhili
    Xu, Kai
    [J]. NEUROCOMPUTING, 2020, 377 : 345 - 359
  • [7] Bolya D., 2023, INT C LEARNING REPRE
  • [8] Bolya D, 2022, P EUROPEAN C COMPUTE, P35, DOI DOI 10.1007/978-3-031-25082-83
  • [9] When Deep Learning Meets Metric Learning: Remote Sensing Image Scene Classification via Learning Discriminative CNNs
    Cheng, Gong
    Yang, Ceyuan
    Yao, Xiwen
    Guo, Lei
    Han, Junwei
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2018, 56 (05): : 2811 - 2821
  • [10] Remote Sensing Image Scene Classification: Benchmark and State of the Art
    Cheng, Gong
    Han, Junwei
    Lu, Xiaoqiang
    [J]. PROCEEDINGS OF THE IEEE, 2017, 105 (10) : 1865 - 1883