Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models

被引：23

作者：

Ma, Chengcheng ^{[1
,2
]}

Liu, Yang ^{[3
]}

Deng, Jiankang ^{[4
]}

Xie, Lingxi ^{[4
]}

Dong, Weiming ^{[1
]}

Xu, Changsheng ^{[1
]}

机构：

[1] Chinese Acad Sci CASIA, Inst Automat, Natl Lab Pattern Recognit NLPR, Beijing 100190, Peoples R China

[2] Univ Chinese Acad Sci UCAS, Sch Artificial Intelligence, Beijing 100049, Peoples R China

[3] Alibaba DAMO Acad, Hangzhou 310024, Peoples R China

[4] Huawei Inc, Shenzhen 518129, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2023年 / 33卷 / 09期

基金：

美国国家科学基金会; 北京市自然科学基金;

关键词：

Vision-language model; prompt tuning; over-fitting; subspace learning; gradient projection;

D O I：

10.1109/TCSVT.2023.3245584

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Pretrained vision-language models (VLMs) such as CLIP have shown impressive generalization capability in downstream vision tasks with appropriate text prompts. Instead of designing prompts manually, Context Optimization (CoOp) has been recently proposed to learn continuous prompts using task-specific training data. Despite the performance improvements on downstream tasks, several studies have reported that CoOp suffers from the overfitting issue in two aspects: (i) the test accuracy on base classes first improves and then worsens during training; (ii) the test accuracy on novel classes keeps decreasing. However, none of the existing studies can understand and mitigate such overfitting problems. In this study, we first explore the cause of overfitting by analyzing the gradient flow. Comparative experiments reveal that CoOp favors generalizable and spurious features in the early and later training stages, respectively, leading to the non-overfitting and overfitting phenomena. Given those observations, we propose Subspace Prompt Tuning (Sub PT) to project the gradients in back-propagation onto the low-rank subspace spanned by the early-stage gradient flow eigenvectors during the entire training process and successfully eliminate the overfitting problem. In addition, we equip CoOp with a Novel Feature Learner (NFL) to enhance the generalization ability of the learned prompts onto novel categories beyond the training set, needless of image training data. Extensive experiments on 11 classification datasets demonstrate that Sub PT+NFL consistently boost the performance of CoOp and outperform the state-of-the-art CoCoOp approach. Experiments on more challenging vision downstream tasks, including open-vocabulary object detection and zero-shot semantic segmentation, also verify the effectiveness of the proposed method. Codes can be found at https://tinyurl.com/mpe64f89.

引用

页码：4616 / 4629

页数：14

共 50 条

[1] Adversarial Prompt Tuning for Vision-Language Models
Zhang, Jiaming
Ma, Xingjun
Wang, Xin
Qiu, Lingyu
Wang, Jiaqi
Jiang, Yu-Gang
Sang, Jitao
COMPUTER VISION - ECCV 2024, PT XLV, 2025, 15103 : 56 - 72
[2] Distribution-Aware Prompt Tuning for Vision-Language Models
Cho, Eulrang
Kim, Jooyeon
Kim, Hyunwoo J.
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 21947 - 21956
[3] Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?
Wu, Cheng-En
Tian, Yu
Yu, Haichao
Wang, Heng
Morgado, Pedro
Hu, Yu Hen
Yang, Linjie
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15442 - 15451
[4] Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models
Kan, Baoshuo
Wang, Teng
Lu, Wenpeng
Zhen, Xiantong
Guan, Weili
Zheng, Feng
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15624 - 15634
[5] Debiased Fine-Tuning for Vision-Language Models by Prompt Regularization
Zhu, Beier
Niu, Yulei
Lee, Saeil
Hur, Minhoe
Zhang, Hanwang
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3, 2023, : 3834 - 3842
[6] Learning to Prompt for Vision-Language Models
Zhou, Kaiyang
Yang, Jingkang
Loy, Chen Change
Liu, Ziwei
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2022, 130 (09) : 2337 - 2348
[7] Learning to Prompt for Vision-Language Models
Kaiyang Zhou
Jingkang Yang
Chen Change Loy
Ziwei Liu
International Journal of Computer Vision, 2022, 130 : 2337 - 2348
[8] Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models
Zhang, Jinrui
Wang, Teng
Zhang, Haigang
Lu, Ping
Zheng, Feng
COMPUTER VISION - ECCV 2024, PT XXXVII, 2025, 15095 : 196 - 213
[9] CTPT: Continual Test-time Prompt Tuning for vision-language models
Wang, Fan
Han, Zhongyi
Liu, Xingbo
Yin, Yilong
Gao, Xin
PATTERN RECOGNITION, 2025, 161
[10] CPT: Colorful Prompt Tuning for pre-trained vision-language models
Yao, Yuan
Zhang, Ao
Zhang, Zhengyan
Liu, Zhiyuan
Chua, Tat-Seng
Sun, Maosong
AI OPEN, 2024, 5 : 30 - 38

← 1 2 3 4 5 →