A General Self-Supervised Framework for Remote Sensing Image Classification

被引:11
作者
Gao, Yuan [1 ,2 ,3 ]
Sun, Xiaojuan [1 ,2 ,3 ]
Liu, Chao [4 ]
机构
[1] Chinese Acad Sci, Aerosp Informat Res Inst, Beijing 100094, Peoples R China
[2] Chinese Acad Sci, Key Lab Technol Geospatial Informat Proc & Applic, Beijing 100190, Peoples R China
[3] Univ Chinese Acad Sci, Sch Elect Elect & Commun Engn, Beijing 101408, Peoples R China
[4] Ditu Beijing Technol Co Ltd, Beijing 100089, Peoples R China
关键词
self-supervised learning; remote sensing; scene classification; deep learning; vision transformer; masked image modeling; SCENE CLASSIFICATION; ATTENTION; MODEL;
D O I
10.3390/rs14194824
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
This paper provides insights into the interpretation beyond simply combining self-supervised learning (SSL) with remote sensing (RS). Inspired by the improved representation ability brought by SSL in natural image understanding, we aim to explore and analyze the compatibility of SSL with remote sensing. In particular, we propose a self-supervised pre-training framework for the first time by applying the masked image modeling (MIM) method to RS image research in order to enhance its efficacy. The completion proxy task used by MIM encourages the model to reconstruct the masked patches, and thus correlate the unseen parts with the seen parts in semantics. Second, in order to figure out how pretext tasks affect downstream performance, we find the attribution consensus of the pre-trained model and downstream tasks toward the proxy and classification targets, which is quite different from that in natural image understanding. Moreover, this transferable consensus is persistent in cross-dataset full or partial fine-tuning, which means that SSL could boost general model-free representation beyond domain bias and task bias (e.g., classification, segmentation, and detection). Finally, on three publicly accessible RS scene classification datasets, our method outperforms the majority of fully supervised state-of-the-art (SOTA) methods with higher accuracy scores on unlabeled datasets.
引用
收藏
页数:18
相关论文
共 75 条
[21]   Randaugment: Practical automated data augmentation with a reduced search space [J].
Cubuk, Ekin D. ;
Zoph, Barret ;
Shlens, Jonathon ;
Le, Quoc, V .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, :3008-3017
[22]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[23]   Multi-task Self-Supervised Visual Learning [J].
Doersch, Carl ;
Zisserman, Andrew .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2070-2079
[24]   Unsupervised Visual Representation Learning by Context Prediction [J].
Doersch, Carl ;
Gupta, Abhinav ;
Efros, Alexei A. .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1422-1430
[25]  
Dosovitskiy Alexey., 2021, PROC INT C LEARN REP, P2021
[26]  
Gidaris S., 2018, ARXIV
[27]  
Guo GD, 2003, LECT NOTES COMPUT SC, V2888, P986
[28]  
Hadsell Raia., 2006, P IEEE C COMP VIS PA, V2, p1735 1742
[29]  
He K., 2022, ARXIV, P16000, DOI [DOI 10.48550/ARXIV.2111.06377, 10.48550/arXiv.2111.06377]
[30]   Momentum Contrast for Unsupervised Visual Representation Learning [J].
He, Kaiming ;
Fan, Haoqi ;
Wu, Yuxin ;
Xie, Saining ;
Girshick, Ross .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :9726-9735