Context-Aware Robust Fine-Tuning

被引:0
作者
Xiaofeng Mao
Yufeng Chen
Xiaojun Jia
Rong Zhang
Hui Xue
Zhao Li
机构
[1] Alibaba Group,Institute of Information Engineering
[2] Chinese Academy of Sciences,undefined
[3] Zhejiang University,undefined
来源
International Journal of Computer Vision | 2024年 / 132卷
关键词
Pre-trained models; CLIP; Fine-tuning; Robustness;
D O I
暂无
中图分类号
学科分类号
摘要
Contrastive language-image pre-trained (CLIP) models have zero-shot ability of classifying an image belonging to “[CLASS]\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathtt {[CLASS]}$$\end{document}” by using similarity between the image and the prompt sentence “a [CONTEXT]\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathtt {[CONTEXT]}$$\end{document} of [CLASS]\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathtt {[CLASS]}$$\end{document}”. Based on exhaustive text cues in “[CONTEXT]\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathtt {[CONTEXT]}$$\end{document}”, CLIP model is aware of different contexts, e.g. background, style, viewpoint, and exhibits unprecedented robustness against a wide range of distribution shifts. However, recent works find further fine-tuning of CLIP models improves accuracy but sacrifices the robustness on downstream tasks. We conduct an empirical investigation to show fine-tuning will corrupt the context-aware ability of pre-trained CLIP features. To solve this problem, we propose Context-Aware Robust Fine-tuning (CAR-FT). CAR-FT regularizes the model during fine-tuning to capture the context information. Specifically, we use zero-shot prompt weights to get the context distribution contained in the image. By minimizing the Kullback–Leibler divergence (KLD) between context distributions induced by original/fine-tuned CLIP models, CAR-FT makes the context-aware ability of CLIP inherited into downstream tasks, and achieves both higher in-distribution (ID) and out-of-distribution (OOD) accuracy. The experimental results show CAR-FT achieves superior robustness on five OOD test datasets of ImageNet, and meanwhile brings accuracy gains on nine downstream tasks. Additionally, CAR-FT surpasses previous domain generalization (DG) methods and gets 78.5% averaged accuracy on DomainBed benchmark, building the new state-of-the-art.
引用
收藏
页码:1685 / 1700
页数:15
相关论文
共 18 条
[1]  
Cha J(2021)Swad: Domain generalization by seeking flat minima Advances in Neural Information Processing Systems 34 22405-22418
[2]  
Chun S(2021)Towards non-iid image classification: A dataset and baselines Pattern Recognition 110 383-90
[3]  
Lee K(2017)Imagenet classification with deep convolutional neural networks Communications of the ACM 60 84-530
[4]  
He Y(2012)A unifying view on dataset shift in classification Pattern Recognition 45 521-73
[5]  
Shen Z(2016)Yfcc100m: The new data in multimedia research Communications of the ACM 59 64-22
[6]  
Cui P(2016)Sun database: Exploring a large collection of scene categories International Journal of Computer Vision 119 3-undefined
[7]  
Krizhevsky A(undefined)undefined undefined undefined undefined-undefined
[8]  
Sutskever I(undefined)undefined undefined undefined undefined-undefined
[9]  
Hinton GE(undefined)undefined undefined undefined undefined-undefined
[10]  
Moreno-Torres JG(undefined)undefined undefined undefined undefined-undefined