Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering

被引:2
作者
Jiang, Jingjing [1 ]
Liu, Ziyi [1 ]
Zheng, Nanning [1 ]
机构
[1] Xi An Jiao Tong Univ, Inst Artificial Intelligence & Robot, Xian 710049, Shaanxi, Peoples R China
基金
美国国家科学基金会;
关键词
Information bottleneck; Robustness; Visual question answering; Vision-language model; LANGUAGE;
D O I
10.1007/s11263-023-01858-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Benefiting from large-scale pretrained vision language models (VLMs), the performance of visual question answering (VQA) has approached human oracles. However, finetuning such models on limited data often suffers from overfitting and poor generalization issues, leading to a lack of model robustness. In this paper, we aim to improve input robustness from an information bottleneck perspective when adapting pretrained VLMs to the downstream VQA task. Input robustness refers to the ability of models to defend against visual and linguistic input variations, as well as shortcut learning involved in inputs. Generally, the representations obtained by pretrained VLMs inevitably contain irrelevant and redundant information for a specific downstream task, resulting in statistically spurious correlations and insensitivity to input variations. To encourage representations to converge to a minimal sufficient statistic in multimodal learning, we propose Correlation Information Bottleneck (CIB), which seeks a tradeoff between compression and redundancy in representations by minimizing the mutual information (MI) between inputs and representations while maximizing the MI between outputs and representations. Moreover, we derive a tight theoretical upper bound for the mutual information between multimodal inputs and representations, incorporating different internal correlations that guide models to learn more robust representations and facilitate modality alignment. Extensive experiments consistently demonstrate the effectiveness and superiority of the proposed CIB in terms of input robustness and accuracy.
引用
收藏
页码:185 / 207
页数:23
相关论文
共 50 条
  • [41] Multimodal Bi-direction Guided Attention Networks for Visual Question Answering
    Linqin Cai
    Nuoying Xu
    Hang Tian
    Kejia Chen
    Haodu Fan
    Neural Processing Letters, 2023, 55 : 11921 - 11943
  • [42] Multimodal Inverse Cloze Task for Knowledge-Based Visual Question Answering
    Lerner, Paul
    Ferret, Olivier
    Guinaudeau, Camille
    ADVANCES IN INFORMATION RETRIEVAL, ECIR 2023, PT I, 2023, 13980 : 569 - 587
  • [43] Robust data augmentation and contrast learning for debiased visual question answering
    Ning, Ke
    Li, Zhixin
    NEUROCOMPUTING, 2025, 626
  • [44] Multimodal Bi-direction Guided Attention Networks for Visual Question Answering
    Cai, Linqin
    Xu, Nuoying
    Tian, Hang
    Chen, Kejia
    Fan, Haodu
    NEURAL PROCESSING LETTERS, 2023, 55 (09) : 11921 - 11943
  • [45] DMRFNet: Deep Multimodal Reasoning and Fusion for Visual Question Answering and explanation generation
    Zhang, Weifeng
    Yu, Jing
    Zhao, Wenhong
    Ran, Chuan
    INFORMATION FUSION, 2021, 72 : 70 - 79
  • [46] Reducing Multi-model Biases for Robust Visual Question Answering
    Zhang F.
    Li Y.
    Li X.
    Xu J.
    Chen Y.
    Beijing Daxue Xuebao (Ziran Kexue Ban)/Acta Scientiarum Naturalium Universitatis Pekinensis, 2024, 60 (01): : 23 - 33
  • [47] Bias-guided margin loss for robust Visual Question Answering
    Sun, Yanhan
    Qi, Jiangtao
    Zhu, Zhenfang
    Li, Kefeng
    Zhao, Liang
    Lv, Lei
    INFORMATION PROCESSING & MANAGEMENT, 2025, 62 (02)
  • [48] Question guided multimodal receptive field reasoning network for fact-based visual question answering
    Zicheng Zuo
    Yanhan Sun
    Zhenfang Zhu
    Mei Wu
    Hui Zhao
    Multimedia Tools and Applications, 2025, 84 (12) : 11063 - 11078
  • [49] Accuracy vs. complexity: A trade-off in visual question answering models
    Farazi, Moshiur
    Khan, Salman
    Barnes, Nick
    PATTERN RECOGNITION, 2021, 120 (120)
  • [50] CAPTURING GLOBAL AND LOCAL INFORMATION IN REMOTE SENSING VISUAL QUESTION ANSWERING
    Guo, Yan
    Huang, Yuancheng
    2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 6340 - 6343