SDPT: Synchronous Dual Prompt Tuning for Fusion-Based Visual-Language Pre-trained Models

被引：0

作者：

Zhou, Yang ^{[1
]}

Wu, Yongjian ^{[1
]}

Saiyin, Jiya ^{[1
]}

Wei, Bingzheng ^{[2
]}

Lai, Maode ^{[3
]}

Chang, Eric ^{[4
]}

Xu, Yan ^{[1
]}

机构：

[1] Beihang Univ, Sch Biol Sci & Med Engn, Beijing, Peoples R China

[2] ByteDance Inc, Beijing, Peoples R China

[3] Zhejiang Univ, Hangzhou, Peoples R China

[4] Taiwan Artificial Intelligence Fdn, Taipei, Taiwan

来源：

COMPUTER VISION - ECCV 2024, PT XLIX | 2025年 / 15107卷

关键词：

Prompt tuning; Parameter-efficient fine-tuning; Visual-language pre-trained models;

D O I：

10.1007/978-3-031-72967-6_19

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Prompt tuning methods have achieved remarkable success in parameter-efficient fine-tuning on large pre-trained models. However, their application to dual-modal fusion-based visual-language pre-trained models (VLPMs), such as GLIP, has encountered issues. Existing prompt tuning methods have not effectively addressed the modal mapping and aligning problem for tokens in different modalities, leading to poor transfer generalization. To address this issue, we propose Synchronous Dual Prompt Tuning (SDPT). SDPT initializes a single set of learnable unified prototype tokens in the established modal aligning space to represent the aligned semantics of text and image modalities for downstream tasks. Furthermore, SDPT establishes inverse linear projections that require no training to embed the information of unified prototype tokens into the input space of different modalities. The inverse linear projections allow the unified prototype token to synchronously represent the two modalities and enable SDPT to share the unified semantics of text and image for downstream tasks across different modal prompts. Experimental results demonstrate that SDPT assists fusion-based VLPMs to achieve superior outcomes with only 0.04% of model parameters for training across various scenarios, outperforming other single- or dual-modal methods. The code will be released at https://github.com/wuyongjianCODE/SDPT.

引用

页码：340 / 356

页数：17

共 16 条

[1] Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model
Xing, Yinghui
Wu, Qirui
Cheng, De
Zhang, Shizhou
Liang, Guoqiang
Wang, Peng
Zhang, Yanning
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 2056 - 2068
[2] CPT: Colorful Prompt Tuning for pre-trained vision-language models
Yao, Yuan
Zhang, Ao
Zhang, Zhengyan
Liu, Zhiyuan
Chua, Tat-Seng
Sun, Maosong
AI OPEN, 2024, 5 : 30 - 38
[3] Zero-Shot Nuclei Detection via Visual-Language Pre-trained Models
Wu, Yongjian
Zhou, Yang
Saiyin, Jiya
Wei, Bingzheng
lai, Maode
Shou, Jianzhong
Fan, Yubo
Xu, Yan
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT VI, 2023, 14225 : 693 - 703
[4] DVPT: Dynamic Visual Prompt Tuning of large pre-trained models for medical image analysis
He, Along
Wu, Yanlin
Wang, Zhihong
Li, Tao
Fu, Huazhu
NEURAL NETWORKS, 2025, 185
[5] Constraint embedding for prompt tuning in vision-language pre-trained model
Cheng, Keyang
Wei, Liutao
Tang, Jingfeng
Zhan, Yongzhao
MULTIMEDIA SYSTEMS, 2025, 31 (01)
[6] Constraint embedding for prompt tuning in vision-language pre-trained modelConstraint embedding for prompt tuning in vision-language pre-trained modelK. Cheng et al.
Keyang Cheng
Liutao Wei
Jingfeng Tang
Yongzhao Zhan
Multimedia Systems, 2025, 31 (1)
[7] Context-focused Prompt Tuning Pre-trained Code Models to Improve Code Summarization
Pan, Xinglu
Liu, Chenxiao
Zou, Yanzhen
Zhao, Xianlin
Xie, Bing
2024 IEEE 48TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE, COMPSAC 2024, 2024, : 1344 - 1349
[8] AttriPrompter: Auto-Prompting With Attribute Semantics for Zero-Shot Nuclei Detection via Visual-Language Pre-Trained Models
Wu, Yongjian
Zhou, Yang
Saiyin, Jiya
Wei, Bingzheng
Lai, Maode
Shou, Jianzhong
Xu, Yan
IEEE TRANSACTIONS ON MEDICAL IMAGING, 2025, 44 (02) : 982 - 993
[9] MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models
Miao, Yongzhu
Li, Shasha
Tang, Jintao
Wang, Ting
2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 25 - 30
[10] Few-shot medical relation extraction via prompt tuning enhanced pre-trained language model
He, Guoxiu
Huang, Chen
NEUROCOMPUTING, 2025, 633

← 1 2 →