Supporting vision-language model few-shot inference with confounder-pruned knowledge prompt

被引：0

作者：

Li, Jiangmeng ^{[1
]}

Mo, Wenyi ^{[2
]}

Song, Fei ^{[1
,3
]}

Sun, Chuxiong ^{[1
]}

Qiang, Wenwen ^{[1
]}

Su, Bing ^{[2
]}

Zheng, Changwen ^{[1
,3
]}

机构：

[1] Chinese Acad Sci, Natl Key Lab Space Integrated Informat Syst, Inst Software, Beijing, Peoples R China

[2] Renmin Univ China, Beijing, Peoples R China

[3] Univ Chinese Acad Sci, Beijing, Peoples R China

来源：

NEURAL NETWORKS | 2025年 / 185卷

基金：

中国博士后科学基金;

关键词：

Multi-modal model; Large-scale pre-training; Prompt learning; Maximum entropy; Knowledge graph;

D O I：

10.1016/j.neunet.2025.107173

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Vision-language models are pre-trained by aligning image-text pairs in a common space to deal with open- set visual concepts. Recent works adopt fixed or learnable prompts, i.e., classification weights are synthesized from natural language descriptions of task-relevant categories, to reduce the gap between tasks during the pre- training and inference phases. However, how and what prompts can improve inference performance remains unclear. In this paper, we explicitly clarify the importance of incorporating semantic information into prompts, while existing prompting methods generate prompts without sufficiently exploring the semantic information of textual labels. Manually constructing prompts with rich semantics requires domain expertise and is extremely time-consuming. To cope with this issue, we propose a knowledge-aware prompt learning method, namely C onfounder-pruned K nowledge P rompt (CPKP), which retrieves an ontology knowledge graph by treating the textual label as a query to extract task-relevant semantic information. CPKP further introduces a double- tier confounder-pruning procedure to refine the derived semantic information. Adhering to the individual causal effect principle, the graph-tier confounders are gradually identified and phased out. The feature-tier confounders are eliminated by following the maximum entropy principle in information theory. Empirically, the evaluations demonstrate the effectiveness of CPKP in few-shot inference, e.g., with only two shots, CPKP outperforms the manual-prompt method by 4.64% and the learnable-prompt method by 1.09% on average.

引用

页数：13

共 47 条

[1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Anderson, Peter
He, Xiaodong
Buehler, Chris
Teney, Damien
Johnson, Mark
Gould, Stephen
Zhang, Lei
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
[2] VQA: Visual Question Answering
Antol, Stanislaw
Agrawal, Aishwarya
Lu, Jiasen
Mitchell, Margaret
Batra, Dhruv
Zitnick, C. Lawrence
Parikh, Devi
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
[3] Bordes A., 2013, ADV NEURAL INFORM PR, V26, P2787, DOI DOI 10.5555/2999792.2999923
[4] Bossard L, 2014, LECT NOTES COMPUT SC, V8694, P446, DOI 10.1007/978-3-319-10599-4_29
[5] Carlson A, 2010, AAAI CONF ARTIF INTE, P1306
[6] Chen T, 2020, PR MACH LEARN RES, V119
[7] Chen YC, 2020, Arxiv, DOI [arXiv:1909.11740, DOI 10.48550/ARXIV.1909.11740]
[8] Describing Textures in the Wild
Cimpoi, Mircea
Maji, Subhransu
Kokkinos, Iasonas
Mohamed, Sammy
Vedaldi, Andrea
[J]. 2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, : 3606 - 3613
[9] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[10] Bayesian Prompt Learning for Image-Language Model Generalization
Derakhshani, Mohammad Mahdi
Sanchez, Enrique
Bulat, Adrian
Turrisi da Costa, Victor Guilherme
Snoek, Cees G. M.
Tzimiropoulos, Georgios
Martinez, Brais
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15191 - 15200

← 1 2 3 4 5 →