P2FEViT: Plug-and-Play CNN Feature Embedded Hybrid Vision Transformer for Remote Sensing Image Classification

被引:21
作者
Wang, Guanqun [1 ,2 ]
Chen, He [1 ,2 ]
Chen, Liang [1 ,2 ]
Zhuang, Yin [1 ,2 ]
Zhang, Shanghang [3 ]
Zhang, Tong [1 ,2 ]
Dong, Hao [3 ]
Gao, Peng [4 ]
机构
[1] Beijing Inst Technol, Sch Informat & Elect, Beijing 100081, Peoples R China
[2] Beijing Key Lab Embedded Real Time Informat Proc, Beijing 100081, Peoples R China
[3] Peking Univ, Sch Comp Sci, Beijing 100871, Peoples R China
[4] Shanghai AI Lab, Shanghai 200232, Peoples R China
基金
美国国家科学基金会; 中国国家自然科学基金;
关键词
remote sensing image classification; vision transformer; plug-and-play; feature embedded; SCENE CLASSIFICATION; SATELLITE; NETWORK; COVER;
D O I
10.3390/rs15071773
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
Remote sensing image classification (RSIC) is a classical and fundamental task in the intelligent interpretation of remote sensing imagery, which can provide unique labeling information for each acquired remote sensing image. Thanks to the potent global context information extraction ability of the multi-head self-attention (MSA) mechanism, visual transformer (ViT)-based architectures have shown excellent capability in natural scene image classification. However, in order to achieve powerful RSIC performance, it is insufficient to capture global spatial information alone. Specifically, for fine-grained target recognition tasks with high inter-class similarity, discriminative and effective local feature representations are key to correct classification. In addition, due to the lack of inductive biases, the powerful global spatial context representation capability of ViT requires lengthy training procedures and large-scale pre-training data volume. To solve the above problems, a hybrid architecture of convolution neural network (CNN) and ViT is proposed to improve the RSIC ability, called P(2)FEViT, which integrates plug-and-play CNN features with ViT. In this paper, the feature representation capabilities of CNN and ViT applying for RSIC are first analyzed. Second, aiming to integrate the advantages of CNN and ViT, a novel approach embedding CNN features into the ViT architecture is proposed, which can make the model synchronously capture and fuse global context and local multimodal information to further improve the classification capability of ViT. Third, based on the hybrid structure, only a simple cross-entropy loss is employed for model training. The model can also have rapid and comfortable convergence with relatively less training data than the original ViT. Finally, extensive experiments are conducted on the public and challenging remote sensing scene classification dataset of NWPU-RESISC45 (NWPU-R45) and the self-built fine-grained target classification dataset called BIT-AFGR50. The experimental results demonstrate that the proposed P2FEViT can effectively improve the feature description capability and obtain outstanding image classification performance, while significantly reducing the high dependence of ViT on large-scale pre-training data volume and accelerating the convergence speed. The code and self-built dataset will be released at our webpages.
引用
收藏
页数:26
相关论文
共 74 条
  • [1] Abdullahi HS, 2017, 2017 SEVENTH INTERNATIONAL CONFERENCE ON INNOVATIVE COMPUTING TECHNOLOGY (INTECH 2017), P155, DOI 10.1109/INTECH.2017.8102436
  • [2] Al-Rfou R, 2019, AAAI CONF ARTIF INTE, P3159
  • [3] [Anonymous], 2022, IEEE COMPUT SOC CONF, DOI DOI 10.1109/CVPRW56347.2022.00309
  • [4] SURF: Speeded up robust features
    Bay, Herbert
    Tuytelaars, Tinne
    Van Gool, Luc
    [J]. COMPUTER VISION - ECCV 2006 , PT 1, PROCEEDINGS, 2006, 3951 : 404 - 417
  • [5] Vision Transformers for Remote Sensing Image Classification
    Bazi, Yakoub
    Bashmal, Laila
    Rahhal, Mohamad M. Al
    Dayil, Reham Al
    Ajlan, Naif Al
    [J]. REMOTE SENSING, 2021, 13 (03) : 1 - 20
  • [6] Simple Yet Effective Fine-Tuning of Deep CNNs Using an Auxiliary Classification Loss for Remote Sensing Scene Classification
    Bazi, Yakoub
    Al Rahhal, Mohamad M.
    Alhichri, Haikel
    Alajlan, Naif
    [J]. REMOTE SENSING, 2019, 11 (24)
  • [7] Brock A, 2021, INT C MACHINE LEARNI, V139
  • [8] CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification
    Chen, Chun-Fu
    Fan, Quanfu
    Panda, Rameswar
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 347 - 356
  • [9] Remote Sensing Image Scene Classification: Benchmark and State of the Art
    Cheng, Gong
    Han, Junwei
    Lu, Xiaoqiang
    [J]. PROCEEDINGS OF THE IEEE, 2017, 105 (10) : 1865 - 1883
  • [10] Remote Sensing Scene Image Classification Based on mmsCNN-HMM with Stacking Ensemble Model
    Cheng, Xiang
    Lei, Hong
    [J]. REMOTE SENSING, 2022, 14 (17)