Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering

被引:2
作者
Jiang, Jingjing [1 ]
Liu, Ziyi [1 ]
Zheng, Nanning [1 ]
机构
[1] Xi An Jiao Tong Univ, Inst Artificial Intelligence & Robot, Xian 710049, Shaanxi, Peoples R China
基金
美国国家科学基金会;
关键词
Information bottleneck; Robustness; Visual question answering; Vision-language model; LANGUAGE;
D O I
10.1007/s11263-023-01858-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Benefiting from large-scale pretrained vision language models (VLMs), the performance of visual question answering (VQA) has approached human oracles. However, finetuning such models on limited data often suffers from overfitting and poor generalization issues, leading to a lack of model robustness. In this paper, we aim to improve input robustness from an information bottleneck perspective when adapting pretrained VLMs to the downstream VQA task. Input robustness refers to the ability of models to defend against visual and linguistic input variations, as well as shortcut learning involved in inputs. Generally, the representations obtained by pretrained VLMs inevitably contain irrelevant and redundant information for a specific downstream task, resulting in statistically spurious correlations and insensitivity to input variations. To encourage representations to converge to a minimal sufficient statistic in multimodal learning, we propose Correlation Information Bottleneck (CIB), which seeks a tradeoff between compression and redundancy in representations by minimizing the mutual information (MI) between inputs and representations while maximizing the MI between outputs and representations. Moreover, we derive a tight theoretical upper bound for the mutual information between multimodal inputs and representations, incorporating different internal correlations that guide models to learn more robust representations and facilitate modality alignment. Extensive experiments consistently demonstrate the effectiveness and superiority of the proposed CIB in terms of input robustness and accuracy.
引用
收藏
页码:185 / 207
页数:23
相关论文
共 50 条
  • [1] Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering
    Jingjing Jiang
    Ziyi Liu
    Nanning Zheng
    International Journal of Computer Vision, 2024, 132 : 185 - 207
  • [2] VISUAL QUESTION ANSWERING IN REMOTE SENSING WITH CROSS-ATTENTION AND MULTIMODAL INFORMATION BOTTLENECK
    Songara, Jayesh
    Pande, Shivam
    Choudhury, Shabnam
    Banerjee, Biplab
    Velmurugan, Rajbabu
    IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 6278 - 6281
  • [3] Adapting Visual Question Answering Models for Enhancing Multimodal Community Q&A Platforms
    Srivastava, Avikalp
    Liu, Hsin-Wen
    Fujita, Sumio
    PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM '19), 2019, : 1421 - 1430
  • [4] OpenViVQA: Task, dataset, and multimodal fusion models for visual question answering in Vietnamese
    Nguyen, Nghia Hieu
    Vo, Duong T. D.
    Nguyen, Kiet Van
    Nguyen, Ngan Luu-Thuy
    INFORMATION FUSION, 2023, 100
  • [5] Improving Visual Question Answering by Multimodal Gate Fusion Network
    Xiang, Shenxiang
    Chen, Qiaohong
    Fang, Xian
    Guo, Menghao
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [6] Finetuning Language Models for Multimodal Question Answering
    Zhang, Xin
    Xie, Wen
    Dai, Ziqi
    Rao, Jun
    Wen, Haokun
    Luo, Xuan
    Zhang, Meishan
    Zhang, Min
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 9420 - 9424
  • [7] Robust Visual Question Answering: Datasets, Methods, and Future Challenges
    Ma, Jie
    Wang, Pinghui
    Kong, Dechen
    Wang, Zewei
    Liu, Jun
    Pei, Hongbin
    Zhao, Junzhou
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (08) : 5575 - 5594
  • [8] Exploring and exploiting model uncertainty for robust visual question answering
    Zhang, Xuesong
    He, Jun
    Zhao, Jia
    Hu, Zhenzhen
    Yang, Xun
    Li, Jia
    Hong, Richang
    MULTIMEDIA SYSTEMS, 2024, 30 (06)
  • [9] QAlayout: Question Answering Layout Based on Multimodal Attention for Visual Question Answering on Corporate Document
    Mahamoud, Ibrahim Souleiman
    Coustaty, Mickael
    Joseph, Aurelie
    d'Andecy, Vincent Poulain
    Ogier, Jean-Marc
    DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 659 - 673
  • [10] Improving Visual Question Answering by Leveraging Depth and Adapting Explainability
    Panesar, Amrita
    Dogan, Fethiye Irmak
    Leite, Iolanda
    2022 31ST IEEE INTERNATIONAL CONFERENCE ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION (IEEE RO-MAN 2022), 2022, : 252 - 259