Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model

被引:17
|
作者
Xing, Yinghui [1 ,2 ]
Wu, Qirui [1 ]
Cheng, De [3 ]
Zhang, Shizhou [1 ]
Liang, Guoqiang [1 ]
Wang, Peng [1 ]
Zhang, Yanning [1 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Xian 710072, Peoples R China
[2] Northwestern Polytech Univ Shenzhen, Res Dev Inst, Shenzhen 518057, Peoples R China
[3] Xidian Univ, Sch Telecommun Engn, Xian 710071, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Tuning; Task analysis; Adaptation models; Computational modeling; Feature extraction; Training; Few-shot learning; transfer learning; image classification; prompt tuning; vision-language model;
D O I
10.1109/TMM.2023.3291588
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the emergence of large pretrained vison-language models such as CLIP, transferable representations can be adapted to a wide range of downstream tasks via prompt tuning. Prompt tuning probes for beneficial information for downstream tasks from the general knowledge stored in the pretrained model. A recently proposed method named Context Optimization (CoOp) introduces a set of learnable vectors as text prompts from the language side. However, tuning the text prompt alone can only adjust the synthesized "classifier", while the computed visual features of the image encoder cannot be affected, thus leading to suboptimal solutions. In this article, we propose a novel dual-modality prompt tuning (DPT) paradigm through learning text and visual prompts simultaneously. To make the final image feature concentrate more on the target visual concept, a class-aware visual prompt tuning (CAVPT) scheme is further proposed in our DPT. In this scheme, the class-aware visual prompt is generated dynamically by performing the cross attention between text prompt features and image patch token embeddings to encode both the downstream task-related information and visual instance information. Extensive experimental results on 11 datasets demonstrate the effectiveness and generalization ability of the proposed method.
引用
收藏
页码:2056 / 2068
页数:13
相关论文
共 50 条
  • [1] Constraint embedding for prompt tuning in vision-language pre-trained model
    Cheng, Keyang
    Wei, Liutao
    Tang, Jingfeng
    Zhan, Yongzhao
    MULTIMEDIA SYSTEMS, 2025, 31 (01)
  • [2] CPT: Colorful Prompt Tuning for pre-trained vision-language models
    Yao, Yuan
    Zhang, Ao
    Zhang, Zhengyan
    Liu, Zhiyuan
    Chua, Tat-Seng
    Sun, Maosong
    AI OPEN, 2024, 5 : 30 - 38
  • [3] Constraint embedding for prompt tuning in vision-language pre-trained modelConstraint embedding for prompt tuning in vision-language pre-trained modelK. Cheng et al.
    Keyang Cheng
    Liutao Wei
    Jingfeng Tang
    Yongzhao Zhan
    Multimedia Systems, 2025, 31 (1)
  • [4] MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models
    Miao, Yongzhu
    Li, Shasha
    Tang, Jintao
    Wang, Ting
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 25 - 30
  • [5] Prompt Tuning for Discriminative Pre-trained Language Models
    Yao, Yuan
    Dong, Bowen
    Zhang, Ao
    Zhang, Zhengyan
    Xie, Ruobing
    Liu, Zhiyuan
    Lin, Leyu
    Sun, Maosong
    Wang, Jianyong
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 3468 - 3473
  • [6] CLIP-Llama: A New Approach for Scene Text Recognition with a Pre-Trained Vision-Language Model and a Pre-Trained Language Model
    Zhao, Xiaoqing
    Xu, Miaomiao
    Silamu, Wushour
    Li, Yanbing
    SENSORS, 2024, 24 (22)
  • [7] Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models
    Zheng, Kecheng
    Wu, Wei
    Feng, Ruili
    Zhu, Kai
    Liu, Jiawei
    Zhao, Deli
    Zha, Zheng-Jun
    Chen, Wei
    Shen, Yujun
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11629 - 11639
  • [8] Universal Adversarial Perturbations for Vision-Language Pre-trained Models
    Zhang, Peng-Fei
    Huang, Zi
    Bai, Guangdong
    PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 862 - 871
  • [9] DictPrompt: Comprehensive dictionary-integrated prompt tuning for pre-trained language model
    Cao, Rui
    Wang, Yihao
    Gao, Ling
    Yang, Meng
    KNOWLEDGE-BASED SYSTEMS, 2023, 273
  • [10] Multimodal Search on Iconclass using Vision-Language Pre-Trained Models
    Santini, Cristian
    Posthumus, Etienne
    Tietz, Tabea
    Tan, Mary Ann
    Bruns, Oleksandra
    Sack, Harald
    2023 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES, JCDL, 2023, : 285 - 287