Semantic Prototyping With CLIP for Few-Shot Object Detection in Remote Sensing Images

被引：0

作者：

Liu, Tianying ^{[1
]}

Zhou, Shuigeng ^{[2
,3
]}

Li, Wengen ^{[1
]}

Zhang, Yichao ^{[1
]}

Guan, Jihong ^{[1
]}

机构：

[1] Tongji Univ, Dept Comp Sci & Technol, Shanghai 201804, Peoples R China

[2] Fudan Univ, Shanghai Key Lab Intelligent Informat Proc, Shanghai 200438, Peoples R China

[3] Fudan Univ, Sch Comp Sci, Shanghai 200438, Peoples R China

来源：

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING | 2025年 / 63卷

基金：

中国国家自然科学基金;

关键词：

Prototypes; Object detection; Remote sensing; Feature extraction; Visualization; Detectors; Training; Proposals; Transformers; Metalearning; Few-shot object detection (FSOD); remote sensing images (RSIs); vision-language model (VLM);

D O I：

10.1109/TGRS.2025.3550372

中图分类号：

P3 [地球物理学]; P59 [地球化学];

学科分类号：

0708 ; 070902 ;

摘要：

Few-shot object detection (FSOD) has been proposed to solve the problem of insufficient data for training, and it has drawn the attention of the remote sensing community in recent years. A mainstream type of FSOD method is to generate class prototypes based on the limited samples to help the construction of classification decision boundaries. However, these constructed prototypes may be far away from the true class centroids in the few-shot scenario. Recently, the vision-language model (VLM) has shown its powerful ability to align the visual features and text features, which leads to strong zero-shot performance on various downstream computer vision tasks when given only texts. Therefore, in this work, we propose to build class prototypes from text descriptions instead of limited visual instances by leveraging a classical pretrained VLM named CLIP. Concretely, we generate prototypes by feeding the CLIP text encoder with class names and enforcing each positive proposal feature to be close to the corresponding prototype. To accelerate the alignment process, we utilize the CLIP visual encoder as another teacher to achieve visual knowledge distillation. Moreover, we adopt prompt tuning to adapt CLIP to the remote sensing scenario. Extensive experiments on two public FSOD datasets, i.e., DIOR and NWPU VHR-10.v2, demonstrate the effectiveness of our method, which yields competitive results with that of existing approaches.

引用

页数：14

共 69 条

[1] DETReg: Unsupervised Pretraining with Region Priors for Object Detection [J].

Bar, Amir ;

Wang, Xin ;

Kantorov, Vadim ;

Reed, Colorado J. ;

Herzig, Roei ;

Chechik, Gal ;

Rohrbach, Anna ;

Darrell, Trevor ;

Globerson, Amir .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :14585-14595

[2] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[3]

Chen H, 2018, AAAI CONF ARTIF INTE, P2836

[4] Prototype-CNN for Few-Shot Object Detection in Remote Sensing Images [J].

Cheng, Gong ;

Yan, Bowei ;

Shi, Peizhen ;

Li, Ke ;

Yao, Xiwen ;

Guo, Lei ;

Han, Junwei .

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60

[5] A survey on object detection in optical remote sensing images [J].

Cheng, Gong ;

Han, Junwei .

ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 2016, 117 :11-28

[6] Multi-class geospatial object detection and geographic image classification based on collection of part detectors [J].

Cheng, Gong ;

Han, Junwei ;

Zhou, Peicheng ;

Guo, Lei .

ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 2014, 98 :119-132

[7] Multi-scale object detection in remote sensing imagery with convolutional neural networks [J].

Deng, Zhipeng ;

Sun, Hao ;

Zhou, Shilin ;

Zhao, Juanping ;

Lei, Lin ;

Zou, Huanxin .

ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 2018, 145 :3-22

[8] Learning RoI Transformer for Oriented Object Detection in Aerial Images [J].

Ding, Jian ;

Xue, Nan ;

Long, Yang ;

Xia, Gui-Song ;

Lu, Qikai .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :2844-2853

[9]

Dosovitskiy A., 2021, P INT C LEARN REPR, DOI [10.48550/arXiv.2010.11929, DOI 10.48550/ARXIV.2010.11929]

[10] A Bayesian approach to unsupervised one-shot learning of object categories [J].

Fei-Fei, L ;

Fergus, R ;

Perona, P .

NINTH IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION, VOLS I AND II, PROCEEDINGS, 2003, :1134-1141

← 1 2 3 4 5 6 7 →