Online Zero-Shot Classification with CLIP

被引：0

作者：

Qian, Qi ^{[1
]}

Hu, Juhua ^{[2
]}

机构：

[1] Alibaba Grp, Bellevue, WA 98004 USA

[2] Univ Washington, Sch Engn & Technol, Tacoma, WA 98402 USA

来源：

COMPUTER VISION - ECCV 2024, PT LXXVII | 2024年 / 15135卷

关键词：

Online learning; Zero-shot classification; CLIP;

D O I：

10.1007/978-3-031-72980-5_27

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Vision-language pre-training such as CLIP enables zeroshot transfer that can classify images according to the candidate class names. While CLIP demonstrates an impressive zero-shot performance on diverse downstream tasks, the distribution from the target data has not been leveraged sufficiently. In this work, we study a novel online zero-shot transfer scenario, where each image arrives in a random order for classification and is visited only once to obtain prediction immediately without storing its representation. Compared with the vanilla zero-shot classification, the proposed framework preserves its flexibility for online service while considering the statistics of the arrived images as the side information to capture the distribution of target data, which can help improve the performance of real-world applications. To tackle the challenge of effective online optimization, we first develop online label learning to model the target data distribution. Then, the proxy of each class in the vision space is further optimized with the proposed online proxy learning method to mitigate the modality gap between images and text. The convergence of both online strategies can be theoretically guaranteed. By combining the predicted label from the online label learning and proxy learning, our online zero-shot transfer method (OnZeta) achieves 78.94% accuracy on ImageNet without accessing the entire data set. Moreover, extensive experiments on other 13 downstream tasks with different vision encoders show a more than 3% improvement on average, which demonstrates the effectiveness of our proposal.

引用

页码：462 / 477

页数：16

共 50 条

[1] Application of CLIP for efficient zero-shot learning
Yang, Hairui
Wang, Ning
Li, Haojie
Wang, Lei
Wang, Zhihui
MULTIMEDIA SYSTEMS, 2024, 30 (04)
[2] CLIPMulti: Explore the performance of multimodal enhanced CLIP for zero-shot text classification
Wang, Peng
Li, Dagang
Hu, Xuesi
Wang, Yongmei
Zhang, Youhua
COMPUTER SPEECH AND LANGUAGE, 2025, 90
[3] Zero-Shot Learning in Maritime Domain: Classification of Marine Objects using CLIP
Lorencin, Ivan
Frank, Domagoj
Vusic, Damir
POMORSTVO-SCIENTIFIC JOURNAL OF MARITIME RESEARCH, 2024, 38 (02) : 239 - 249
[4] Improving Zero-Shot Generalization for CLIP with Variational Adapter
Lu, Ziqian
Shen, Fengli
Liu, Mushui
Yu, Yunlong
Li, Xi
COMPUTER VISION - ECCV 2024, PT XX, 2025, 15078 : 328 - 344
[5] Zero-Shot Turkish Text Classification
Birim, Ahmet
Erden, Mustafa
Arslan, Levent M.
29TH IEEE CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS (SIU 2021), 2021,
[6] Latent Embeddings for Zero-shot Classification
Xian, Yongqin
Akata, Zeynep
Sharma, Gaurav
Nguyen, Quynh
Hein, Matthias
Schiele, Bernt
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 69 - 77
[7] ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation
Zhou, Ziqin
Lei, Yinjie
Zhano, Bowen
Liu, Lingqiao
Liu, Yifan
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 11175 - 11185
[8] ClipSAM: CLIP and SAM collaboration for zero-shot anomaly segmentation
Li, Shengze
Cao, Jianjian
Ye, Peng
Ding, Yuhan
Tu, Chongjun
Chen, Tao
NEUROCOMPUTING, 2025, 618
[9] CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No
Wang, Hualiang
Li, Yi
Yao, Huifeng
Li, Xiaomeng
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 1802 - 1812
[10] Generating Visual Representations for Zero-Shot Classification
Bucher, Maxime
Herbin, Stephane
Jurie, Frederic
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2017), 2017, : 2666 - 2673

← 1 2 3 4 5 →