A hybrid CNN-vision transformer structure for remote sensing scene classification

被引：1

作者：

Li, Nan ^{[1
]}

Hao, Siyuan ^{[2
]}

Zhao, Kun ^{[1
]}

机构：

[1] Qingdao Univ Technol, Sch Informat & Control Engn, Qingdao, Shandong, Peoples R China

[2] Beijing Jiaotong Univ, Sch Software Engn, Beijing, Peoples R China

来源：

REMOTE SENSING LETTERS | 2024年 / 15卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Remote sensing image; scene classification; swin-transformer; convolutional neural network; FUSION; MODEL;

D O I：

10.1080/2150704X.2024.2302348

中图分类号：

TP7 [遥感技术];

学科分类号：

081102 ; 0816 ; 081602 ; 083002 ; 1404 ;

摘要：

Vision Transformers (ViTs) have become one of the main architectures in deep learning with the self-attention mechanism, and are becoming an alternative to Convolutional Neural Networks (CNNs) for remote sensing scene classification tasks. However, the earlier self-attention layer of ViTs focuses on local features rather than global features, and the deeper self-attention layer focuses on global features but ignores the different impact of different frequency information. This will greatly increase the training and computational cost due to the quadratic complexity of the self-attention mechanism on the long sequence representation. In this paper, we propose a hybrid CNN - vision transformer structure (HCVNet), which uses convolutional layers to replace the earlier self-attention layers, and a novel Frequency Multi-head Self Attention (F-MSA) mechanism to replace the deeper self-attention layers. Specifically, F-MSA is a dual-stream structure that reduces the computational cost and improves the classification performance by encoding the high/low frequency information separately. In addition, a Semantic-aware Localization (SaL) module is introduced, which can guide the selection of crop by learning prior knowledge, avoiding the issue of pure background sampling. Our method performed an accuracy value of 97.20 +/- 0.02% on the Aerial Image Dataset and 93.89 +/- 0.03% on the NWPU-RESISC45 Dataset, with low complexity costs.

引用

页码：88 / 98

页数：11

共 50 条

[21] P2FEViT: Plug-and-Play CNN Feature Embedded Hybrid Vision Transformer for Remote Sensing Image Classification
Wang, Guanqun
Chen, He
Chen, Liang
Zhuang, Yin
Zhang, Shanghang
Zhang, Tong
Dong, Hao
Gao, Peng
REMOTE SENSING, 2023, 15 (07)
[22] Recent advances in the application of vision transformers to remote sensing image scene classification
Kumari, Monika
Kaul, Ajay
REMOTE SENSING LETTERS, 2023, 14 (07) : 722 - 732
[23] Multi-Output Network Combining GNN and CNN for Remote Sensing Scene Classification
Peng, Feifei
Lu, Wei
Tan, Wenxia
Qi, Kunlun
Zhang, Xiaokang
Zhu, Quansheng
REMOTE SENSING, 2022, 14 (06)
[24] Energy-Based CNN Pruning for Remote Sensing Scene Classification
Lu, Yiheng
Gong, Maoguo
Hu, Zhuping
Zhao, Wei
Guan, Ziyu
Zhang, Mingyang
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
[25] Remote Sensing Image Scene Classification Using CNN-CapsNet
Zhang, Wei
Tang, Ping
Zhao, Lijun
REMOTE SENSING, 2019, 11 (05)
[26] Remote Sensing Image Classification Method Based on Fusion of CNN and Transformer
Jin Chuan
Tong Changqing
LASER & OPTOELECTRONICS PROGRESS, 2023, 60 (20)
[27] Integration of heterogeneous features for remote sensing scene classification
Wang, Xin
Xiong, Xingnan
Ning, Chen
Shi, Aiye
Lv, Guofang
JOURNAL OF APPLIED REMOTE SENSING, 2018, 12 (01):
[28] Remote Sensing Scene Classification Using Spatial Transformer Fusion Network
Tong, Shun
Qi, Kunlun
Guan, Qingfeng
Zhu, Qiqi
Yang, Chao
Zheng, Jie
IGARSS 2020 - 2020 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2020, : 549 - 552
[29] GLFFNet model for remote sensing image scene classification
Wang W.
Deng J.
Wang X.
Li Z.
Yuan P.
Cehui Xuebao/Acta Geodaetica et Cartographica Sinica, 2023, 52 (10): : 1693 - 1702
[30] FCT: fusing CNN and transformer for scene classification
Xie, Yuxiang
Yan, Jie
Kang, Lai
Guo, Yanming
Zhang, Jiahui
Luan, Xidao
INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2022, 11 (04) : 611 - 618

← 1 2 3 4 5 →