SDPT: Synchronous Dual Prompt Tuning for Fusion-Based Visual-Language Pre-trained Models

被引:0
|
作者
Zhou, Yang [1 ]
Wu, Yongjian [1 ]
Saiyin, Jiya [1 ]
Wei, Bingzheng [2 ]
Lai, Maode [3 ]
Chang, Eric [4 ]
Xu, Yan [1 ]
机构
[1] Beihang Univ, Sch Biol Sci & Med Engn, Beijing, Peoples R China
[2] ByteDance Inc, Beijing, Peoples R China
[3] Zhejiang Univ, Hangzhou, Peoples R China
[4] Taiwan Artificial Intelligence Fdn, Taipei, Taiwan
来源
COMPUTER VISION - ECCV 2024, PT XLIX | 2025年 / 15107卷
关键词
Prompt tuning; Parameter-efficient fine-tuning; Visual-language pre-trained models;
D O I
10.1007/978-3-031-72967-6_19
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Prompt tuning methods have achieved remarkable success in parameter-efficient fine-tuning on large pre-trained models. However, their application to dual-modal fusion-based visual-language pre-trained models (VLPMs), such as GLIP, has encountered issues. Existing prompt tuning methods have not effectively addressed the modal mapping and aligning problem for tokens in different modalities, leading to poor transfer generalization. To address this issue, we propose Synchronous Dual Prompt Tuning (SDPT). SDPT initializes a single set of learnable unified prototype tokens in the established modal aligning space to represent the aligned semantics of text and image modalities for downstream tasks. Furthermore, SDPT establishes inverse linear projections that require no training to embed the information of unified prototype tokens into the input space of different modalities. The inverse linear projections allow the unified prototype token to synchronously represent the two modalities and enable SDPT to share the unified semantics of text and image for downstream tasks across different modal prompts. Experimental results demonstrate that SDPT assists fusion-based VLPMs to achieve superior outcomes with only 0.04% of model parameters for training across various scenarios, outperforming other single- or dual-modal methods. The code will be released at https://github.com/wuyongjianCODE/SDPT.
引用
收藏
页码:340 / 356
页数:17
相关论文
共 16 条
  • [1] Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model
    Xing, Yinghui
    Wu, Qirui
    Cheng, De
    Zhang, Shizhou
    Liang, Guoqiang
    Wang, Peng
    Zhang, Yanning
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 2056 - 2068
  • [2] CPT: Colorful Prompt Tuning for pre-trained vision-language models
    Yao, Yuan
    Zhang, Ao
    Zhang, Zhengyan
    Liu, Zhiyuan
    Chua, Tat-Seng
    Sun, Maosong
    AI OPEN, 2024, 5 : 30 - 38
  • [3] Zero-Shot Nuclei Detection via Visual-Language Pre-trained Models
    Wu, Yongjian
    Zhou, Yang
    Saiyin, Jiya
    Wei, Bingzheng
    lai, Maode
    Shou, Jianzhong
    Fan, Yubo
    Xu, Yan
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT VI, 2023, 14225 : 693 - 703
  • [4] DVPT: Dynamic Visual Prompt Tuning of large pre-trained models for medical image analysis
    He, Along
    Wu, Yanlin
    Wang, Zhihong
    Li, Tao
    Fu, Huazhu
    NEURAL NETWORKS, 2025, 185
  • [5] Constraint embedding for prompt tuning in vision-language pre-trained model
    Cheng, Keyang
    Wei, Liutao
    Tang, Jingfeng
    Zhan, Yongzhao
    MULTIMEDIA SYSTEMS, 2025, 31 (01)
  • [6] Constraint embedding for prompt tuning in vision-language pre-trained modelConstraint embedding for prompt tuning in vision-language pre-trained modelK. Cheng et al.
    Keyang Cheng
    Liutao Wei
    Jingfeng Tang
    Yongzhao Zhan
    Multimedia Systems, 2025, 31 (1)
  • [7] Context-focused Prompt Tuning Pre-trained Code Models to Improve Code Summarization
    Pan, Xinglu
    Liu, Chenxiao
    Zou, Yanzhen
    Zhao, Xianlin
    Xie, Bing
    2024 IEEE 48TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE, COMPSAC 2024, 2024, : 1344 - 1349
  • [8] AttriPrompter: Auto-Prompting With Attribute Semantics for Zero-Shot Nuclei Detection via Visual-Language Pre-Trained Models
    Wu, Yongjian
    Zhou, Yang
    Saiyin, Jiya
    Wei, Bingzheng
    Lai, Maode
    Shou, Jianzhong
    Xu, Yan
    IEEE TRANSACTIONS ON MEDICAL IMAGING, 2025, 44 (02) : 982 - 993
  • [9] MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models
    Miao, Yongzhu
    Li, Shasha
    Tang, Jintao
    Wang, Ting
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 25 - 30
  • [10] Few-shot medical relation extraction via prompt tuning enhanced pre-trained language model
    He, Guoxiu
    Huang, Chen
    NEUROCOMPUTING, 2025, 633