Vision Transformer: An Excellent Teacher for Guiding Small Networks in Remote Sensing Image Scene Classification

被引:81
作者
Xu, Kejie [1 ]
Deng, Peifang [1 ]
Huang, Hong [1 ,2 ]
机构
[1] Chongqing Univ, Key Lab Optoelect Technol & Syst, Educ Minist China, Chongqing 400044, Peoples R China
[2] Chongqing Univ, State Key Lab Coal Mine Disaster Dynam & Control, Chongqing 400044, Peoples R China
来源
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING | 2022年 / 60卷
基金
中国国家自然科学基金;
关键词
Remote sensing; Feature extraction; Transformers; Context modeling; Computational modeling; Semantics; Layout; Convolutional neural network (CNN); high spatial resolution (HSR) images; knowledge distillation (KD); scene classification; vision transformer (ViT); CONVOLUTIONAL NEURAL-NETWORKS; ATTENTION; REPRESENTATION; SEGMENTATION; ENCODER; FUSION;
D O I
10.1109/TGRS.2022.3152566
中图分类号
P3 [地球物理学]; P59 [地球化学];
学科分类号
0708 ; 070902 ;
摘要
Scene classification is an active research topic in the remote sensing community, and complex spatial layouts with various types of objects bring huge challenges to classification. Convolutional neural network (CNN)-based methods attempt to explore the global features by gradually expanding the receptive field, while long-range contextual information is ignored. Vision transformer (ViT) can extract contextual features, but the learning ability of local information is limited, and it has a large computational complexity simultaneously. In this article, an end-to-end method is exploited by employing ViT as an excellent teacher for guiding small networks (ET-GSNet) in the remote sensing image scene classification. In the ET-GSNet, ResNet18 is selected as the student model, which integrates the superiorities of the two models via knowledge distillation (KD), and the computational complexity does not increase. In the KD process, the ViT and ResNet18 are optimized together without independent pretraining, and the learning rate of teacher model gradually decreases until zero, while the weight coefficient of the KD loss module is doubled. Based on the above procedures, dark knowledge from the teacher model can be transferred to the student model more smoothly. Experimental results on the four public remote sensing datasets demonstrate that the proposed ET-GSNet method possesses the superior classification performance compared to some state-of-the-art (SOTA) methods. In addition, we evaluate the ET-GSNet on a fine-grained ship recognition dataset, and the results show that our method has good generalization for different tasks in terms of some metrics.
引用
收藏
页数:15
相关论文
共 90 条
[1]   Vision Transformers for Remote Sensing Image Classification [J].
Bazi, Yakoub ;
Bashmal, Laila ;
Rahhal, Mohamad M. Al ;
Dayil, Reham Al ;
Ajlan, Naif Al .
REMOTE SENSING, 2021, 13 (03) :1-20
[2]   Simple Yet Effective Fine-Tuning of Deep CNNs Using an Auxiliary Classification Loss for Remote Sensing Scene Classification [J].
Bazi, Yakoub ;
Al Rahhal, Mohamad M. ;
Alhichri, Haikel ;
Alajlan, Naif .
REMOTE SENSING, 2019, 11 (24)
[3]   Local Semantic Enhanced ConvNet for Aerial Scene Recognition [J].
Bi, Qi ;
Qin, Kun ;
Zhang, Han ;
Xia, Gui-Song .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 :6498-6511
[4]   A Multiple-Instance Densely-Connected ConvNet for Aerial Scene Classification [J].
Bi, Qi ;
Qin, Kun ;
Li, Zhili ;
Zhang, Han ;
Xu, Kai ;
Xia, Gui-Song .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 :4911-4926
[5]   RADC-Net: A residual attention based convolution network for aerial scene classification [J].
Bi, Qi ;
Qin, Kun ;
Zhang, Han ;
Li, Zhili ;
Xu, Kai .
NEUROCOMPUTING, 2020, 377 :345-359
[6]   Self-Attention-Based Deep Feature Fusion for Remote Sensing Scene Classification [J].
Cao, Ran ;
Fang, Leyuan ;
Lu, Ting ;
He, Nanjun .
IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2021, 18 (01) :43-47
[7]   Deep Feature Fusion for VHR Remote Sensing Scene Classification [J].
Chaib, Souleyman ;
Liu, Huan ;
Gu, Yanfeng ;
Yao, Hongxun .
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2017, 55 (08) :4775-4784
[8]   Training Small Networks for Scene Classification of Remote Sensing Images via Knowledge Distillation [J].
Chen, Guanzhou ;
Zhang, Xiaodong ;
Tan, Xiaoliang ;
Cheng, Yufeng ;
Dai, Fan ;
Zhu, Kun ;
Gong, Yuanfu ;
Wang, Qing .
REMOTE SENSING, 2018, 10 (05)
[9]   Contextual Information-Preserved Architecture Learning for Remote-Sensing Scene Classification [J].
Chen, Jie ;
Huang, Haozhe ;
Peng, Jian ;
Zhu, Jiawei ;
Chen, Li ;
Tao, Chao ;
Li, Haifeng .
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
[10]   Remote Sensing Image Scene Classification Meets Deep Learning: Challenges, Methods, Benchmarks, and Opportunities [J].
Cheng, Gong ;
Xie, Xingxing ;
Han, Junwei ;
Guo, Lei ;
Xia, Gui-Song .
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2020, 13 :3735-3756