SuS-X: Training-Free Name-Only Transfer of Vision-Language Models

被引:10
|
作者
Udandarao, Vishaal [1 ]
Gupta, Ankush [2 ]
Albanie, Samuel [1 ]
机构
[1] Univ Cambridge, Cambridge, England
[2] DeepMind, London, England
关键词
D O I
10.1109/ICCV51070.2023.00257
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Contrastive Language-Image Pre-training (CLIP) has emerged as a simple yet effective way to train large-scale vision-language models. CLIP demonstrates impressive zero-shot classification and retrieval performance on diverse downstream tasks. However, to leverage its full potential, fine-tuning still appears to be necessary. Fine-tuning the entire CLIP model can be resource-intensive and unstable. Moreover, recent methods that aim to circumvent this need for fine-tuning still require access to images from the target task distribution. In this paper, we pursue a different approach and explore the regime of training-free "name-only transfer" in which the only knowledge we possess about the downstream task comprises the names of downstream target categories. We propose a novel method, SuS-X, consisting of two key building blocks-"SuS" and "TIP-X", that requires neither intensive fine-tuning nor costly labelled data. SuS-X achieves state-of-the-art (SoTA) zero-shot classification results on 19 benchmark datasets. We further show the utility of TIP-X in the training-free few-shot setting, where we again achieve SoTA results over strong training-free baselines. Code is available at https://github.com/vishaal27/SuS-X.
引用
收藏
页码:2725 / 2736
页数:12
相关论文
共 28 条
  • [1] Towards Adversarial Attack on Vision-Language Pre-training Models
    Zhang, Jiaming
    Yi, Qi
    Sang, Jitao
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5005 - 5013
  • [2] Transferable Multimodal Attack on Vision-Language Pre-training Models
    Wang, Haodi
    Dong, Kai
    Zhu, Zhilei
    Qin, Haotong
    Liu, Aishan
    Fang, Xiaolin
    Wang, Jiakai
    Liu, Xianglong
    45TH IEEE SYMPOSIUM ON SECURITY AND PRIVACY, SP 2024, 2024, : 1722 - 1740
  • [3] LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models
    Javaheripi, Mojan
    de Rosa, Gustavo H.
    Mukherjee, Subhabrata
    Shah, Shital
    Religa, Tomasz L.
    Mendes, Caio C. T.
    Bubeck, Sebastien
    Koushanfar, Farinaz
    Dey, Debadeepta
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [4] LiFT: Transfer Learning in Vision-Language Models for Downstream Adaptation and Generalization
    Li, Jingzheng
    Sun, Hailong
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4678 - 4687
  • [5] HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models
    Ning, Shan
    Qiu, Longtian
    Liu, Yongfei
    He, Xuming
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23507 - 23517
  • [6] Source-Free Domain Adaptation Guided by Vision and Vision-Language Pre-training
    Zhang, Wenyu
    Shen, Li
    Foo, Chuan-Sheng
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025, 133 (02) : 844 - 866
  • [7] Contrastive Region Guidance: Improving Grounding in Vision-Language Models Without Training
    Wan, David
    Cho, Jaemin
    Stengel-Eskin, Elias
    Bansal, Mohit
    COMPUTER VISION - ECCV 2024, PT LXXIX, 2025, 15137 : 198 - 215
  • [8] Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models
    Huang, Po-Yao
    Patrick, Mandela
    Hu, Junjie
    Neubig, Graham
    Metze, Florian
    Hauptmann, Alexander
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 2443 - 2459
  • [9] GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods
    Yin, Da
    Gao, Feng
    Thattai, Govind
    Johnston, Michael
    Chang, Kai -Wei
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10951 - 10961
  • [10] Multimodal alignment augmentation transferable attack on vision-language pre-training models
    Fu, Tingchao
    Zhang, Jinhong
    Li, Fanxiao
    Wei, Ping
    Zeng, Xianglong
    Zhou, Wei
    PATTERN RECOGNITION LETTERS, 2025, 191 : 131 - 137