HELViT: highly efficient lightweight vision transformer for remote sensing image scene classification

被引:0
作者
Dongen Guo
Zechen Wu
Jiangfan Feng
Zhuoke Zhou
Zhen Shen
机构
[1] Nanyang Institute of Technology,School of Computer and Software
[2] Chongqing University of Posts and Telecommunications,Chongqing Engineering Research Center for Spatial Big Data Intelligent Technology
来源
Applied Intelligence | 2023年 / 53卷
关键词
Remote sensing image; Scene classification; Highly efficient lightweight model; Adaptive token merging; Fast multi-head self attention; Vision transformer (Vi);
D O I
暂无
中图分类号
学科分类号
摘要
Remote sensing image scene classification methods based on convolutional neural networks (CNN) have been extremely successful. However, the limitations of CNN itself make it difficult to acquire global information. The traditional Vision Transformer can effectively capture long-distance dependencies for acquiring global information, but it is computationally intensive. In addition, each class of scene in remote sensing images has a large quantity of the similar background or foreground features. To effectively leverage those similar features and reduce the computation, a highly efficient lightweight vision transformer (HELViT) is proposed. HELViT is a hybrid model combining CNN and Transformer and consists of the Convolution and Attention Block (CAB), the Convolution and Token Merging Block (CTMB). Specifically, in CAB module, the embedding layer in the original Vision Transformer is replaced with a modified MBConv (MBConv∗\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^*$$\end{document}), and the Fast Multi-Head Self Attention (F-MHSA) is used to change the quadratic complexity of the self-attention mechanism to linear. To further decreasing the model’s computational cost, CTMB employs the adaptive token merging (ATOME) to fuse some related foreground or background features. The experimental results on the UCM, AID and NWPU datasets show that the proposed model displays better results in terms of accuracy and efficiency than the state-of-the-art remote sensing scene classification methods. On the most challenging NWPU dataset, HELViT achieves the highest accuracy of 94.64%/96.84% with 4.6G GMACs for 10%/20% training samples, respectively.
引用
收藏
页码:24947 / 24962
页数:15
相关论文
共 82 条
[1]  
Zhang Wei(2019)Remote sensing image scene classification using CNN-CapsNet Remote Sensing 11 494-2821
[2]  
Tang Ping(2018)When deep learning meets metric learning: Remote sensing image scene classification via learning discriminative CNNs IEEE transactions on geoscience and remote sensing 56 2811-2534
[3]  
Zhao Lijun(2020)Rsnet: The search for remote sensing deep neural networks in recognition tasks IEEE Transactions on Geoscience and Remote Sensing 59 2520-16755
[4]  
Cheng Gong(2022)Aggregating features from dual paths for remote sensing image scene classification IEEE Access 10 16740-13
[5]  
Yang Ceyuan(2022)Lhnet: Laplacian convolutional block for remote sensing image scene classification IEEE Transactions on Geoscience and Remote Sensing 60 1-1898
[6]  
Yao Xiwen(2020)Multilayer feature fusion network for scene classification in remote sensing IEEE Geoscience and Remote Sensing Letters 17 1894-12
[7]  
Guo Lei(2022)Scvit: A spatial-channel feature preserving vision transformer for remote sensing image scene classification IEEE Transactions on Geoscience and Remote Sensing 60 1-5
[8]  
Han Junwei(2022)C IEEE Geoscience and Remote Sensing Letters 19 1-3981
[9]  
Wang J(2017)-capsvit: Cross-context and cross-scale capsule vision transformers for remote sensing image scene classification IEEE Transactions on Geoscience and Remote Sensing 55 3965-1883
[10]  
Zhong Y(2017)Aid: A benchmark data set for performance evaluation of aerial scene classification Proceedings of the IEEE 105 1865-1607