Supporting vision-language model few-shot inference with confounder-pruned knowledge prompt

被引:0
作者
Li, Jiangmeng [1 ]
Mo, Wenyi [2 ]
Song, Fei [1 ,3 ]
Sun, Chuxiong [1 ]
Qiang, Wenwen [1 ]
Su, Bing [2 ]
Zheng, Changwen [1 ,3 ]
机构
[1] Chinese Acad Sci, Natl Key Lab Space Integrated Informat Syst, Inst Software, Beijing, Peoples R China
[2] Renmin Univ China, Beijing, Peoples R China
[3] Univ Chinese Acad Sci, Beijing, Peoples R China
基金
中国博士后科学基金;
关键词
Multi-modal model; Large-scale pre-training; Prompt learning; Maximum entropy; Knowledge graph;
D O I
10.1016/j.neunet.2025.107173
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-language models are pre-trained by aligning image-text pairs in a common space to deal with open- set visual concepts. Recent works adopt fixed or learnable prompts, i.e., classification weights are synthesized from natural language descriptions of task-relevant categories, to reduce the gap between tasks during the pre- training and inference phases. However, how and what prompts can improve inference performance remains unclear. In this paper, we explicitly clarify the importance of incorporating semantic information into prompts, while existing prompting methods generate prompts without sufficiently exploring the semantic information of textual labels. Manually constructing prompts with rich semantics requires domain expertise and is extremely time-consuming. To cope with this issue, we propose a knowledge-aware prompt learning method, namely C onfounder-pruned K nowledge P rompt (CPKP), which retrieves an ontology knowledge graph by treating the textual label as a query to extract task-relevant semantic information. CPKP further introduces a double- tier confounder-pruning procedure to refine the derived semantic information. Adhering to the individual causal effect principle, the graph-tier confounders are gradually identified and phased out. The feature-tier confounders are eliminated by following the maximum entropy principle in information theory. Empirically, the evaluations demonstrate the effectiveness of CPKP in few-shot inference, e.g., with only two shots, CPKP outperforms the manual-prompt method by 4.64% and the learnable-prompt method by 1.09% on average.
引用
收藏
页数:13
相关论文
共 47 条
  • [1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
    Anderson, Peter
    He, Xiaodong
    Buehler, Chris
    Teney, Damien
    Johnson, Mark
    Gould, Stephen
    Zhang, Lei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
  • [2] VQA: Visual Question Answering
    Antol, Stanislaw
    Agrawal, Aishwarya
    Lu, Jiasen
    Mitchell, Margaret
    Batra, Dhruv
    Zitnick, C. Lawrence
    Parikh, Devi
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
  • [3] Bordes A., 2013, ADV NEURAL INFORM PR, V26, P2787, DOI DOI 10.5555/2999792.2999923
  • [4] Bossard L, 2014, LECT NOTES COMPUT SC, V8694, P446, DOI 10.1007/978-3-319-10599-4_29
  • [5] Carlson A, 2010, AAAI CONF ARTIF INTE, P1306
  • [6] Chen T, 2020, PR MACH LEARN RES, V119
  • [7] Chen YC, 2020, Arxiv, DOI [arXiv:1909.11740, DOI 10.48550/ARXIV.1909.11740]
  • [8] Describing Textures in the Wild
    Cimpoi, Mircea
    Maji, Subhransu
    Kokkinos, Iasonas
    Mohamed, Sammy
    Vedaldi, Andrea
    [J]. 2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, : 3606 - 3613
  • [9] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
  • [10] Bayesian Prompt Learning for Image-Language Model Generalization
    Derakhshani, Mohammad Mahdi
    Sanchez, Enrique
    Bulat, Adrian
    Turrisi da Costa, Victor Guilherme
    Snoek, Cees G. M.
    Tzimiropoulos, Georgios
    Martinez, Brais
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15191 - 15200