Supporting vision-language model few-shot inference with confounder-pruned knowledge prompt

被引:0
作者
Li, Jiangmeng [1 ]
Mo, Wenyi [2 ]
Song, Fei [1 ,3 ]
Sun, Chuxiong [1 ]
Qiang, Wenwen [1 ]
Su, Bing [2 ]
Zheng, Changwen [1 ,3 ]
机构
[1] Chinese Acad Sci, Natl Key Lab Space Integrated Informat Syst, Inst Software, Beijing, Peoples R China
[2] Renmin Univ China, Beijing, Peoples R China
[3] Univ Chinese Acad Sci, Beijing, Peoples R China
基金
中国博士后科学基金;
关键词
Multi-modal model; Large-scale pre-training; Prompt learning; Maximum entropy; Knowledge graph;
D O I
10.1016/j.neunet.2025.107173
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-language models are pre-trained by aligning image-text pairs in a common space to deal with open- set visual concepts. Recent works adopt fixed or learnable prompts, i.e., classification weights are synthesized from natural language descriptions of task-relevant categories, to reduce the gap between tasks during the pre- training and inference phases. However, how and what prompts can improve inference performance remains unclear. In this paper, we explicitly clarify the importance of incorporating semantic information into prompts, while existing prompting methods generate prompts without sufficiently exploring the semantic information of textual labels. Manually constructing prompts with rich semantics requires domain expertise and is extremely time-consuming. To cope with this issue, we propose a knowledge-aware prompt learning method, namely C onfounder-pruned K nowledge P rompt (CPKP), which retrieves an ontology knowledge graph by treating the textual label as a query to extract task-relevant semantic information. CPKP further introduces a double- tier confounder-pruning procedure to refine the derived semantic information. Adhering to the individual causal effect principle, the graph-tier confounders are gradually identified and phased out. The feature-tier confounders are eliminated by following the maximum entropy principle in information theory. Empirically, the evaluations demonstrate the effectiveness of CPKP in few-shot inference, e.g., with only two shots, CPKP outperforms the manual-prompt method by 4.64% and the learnable-prompt method by 1.09% on average.
引用
收藏
页数:13
相关论文
共 47 条
  • [31] SCConv: Spatial and Channel Reconstruction Convolution for Feature Redundancy
    Li, Jiafeng
    Wen, Ying
    He, Lianghua
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6153 - 6162
  • [32] Li X., 2020, EUR C COMP VIS, P121
  • [33] Lin WY, 2021, PR MACH LEARN RES, V139
  • [34] Lin YK, 2015, AAAI CONF ARTIF INTE, P2181
  • [35] Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models
    Lin, Zhiqiu
    Yu, Samuel
    Kuang, Zhiyi
    Pathak, Deepak
    Ramanan, Deva
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19325 - 19337
  • [36] Liu X., 2022, arXiv, DOI [10.48550/arXiv.2210.11464, DOI 10.48550/ARXIV.2210.11464]
  • [37] Liu Xiyang, 2022, ADV NEURAL INFORM PR
  • [38] Lu JS, 2019, ADV NEUR IN, V32
  • [39] Maji S, 2013, Arxiv, DOI [arXiv:1306.5151, DOI 10.48550/ARXIV.1306.5151]
  • [40] Statistical mechanics of a collisionless system based on the maximum entropy principle
    Nakamura, TK
    [J]. ASTROPHYSICAL JOURNAL, 2000, 531 (02) : 739 - 743