Robust data augmentation and contrast learning for debiased visual question answering

被引:0
作者
Ning, Ke [1 ,2 ]
Li, Zhixin [1 ,2 ]
机构
[1] Guangxi Normal Univ, Key Lab Educ Blockchain & Intelligent Technol, Minist Educ, Guilin 541004, Peoples R China
[2] Guangxi Normal Univ, Guangxi Key Lab Multisource Informat Min & Secur, Guilin 541004, Peoples R China
基金
中国国家自然科学基金;
关键词
Visual question answering; Language priors; Data augmentation; Knowledge distillation; Contrastive learning;
D O I
10.1016/j.neucom.2025.129527
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The language prior problem in VQA causes the model to make predictions directly based on spurious correlations between questions and answers, causing the model's performance to drop sharply outside the distribution. Current debiasing methods often achieve good out-of-distribution generalization capabilities at the expense of significant in-distribution performance degradation, while non-debiasing methods sacrifice a large amount of out-of-distribution performance to achieve high in-distribution performance. We propose a novel method combining multi-teacher knowledge distillation and contrastive learning (MKDCL) to solve the language prior problem in VQA. We propose a Question Answer Selection (QAS) module to select reasonable questions for images, which also determines the pseudo answers with multi-teacher's weighted predictions. The Contrastive Learning Samples Generation (CLSG) module we propose synthesizes four types of positive and negative samples in visual and language modalities for contrastive learning, effectively increases the semantic dependency of the images while avoiding performance degradation due to spurious correlations between questions and answers. Our method is model-agnostic and achieves state-of-the-art performance (62.93%) on the language prior-sensitive VQA-CP v2 dataset while maintaining performance (65.43%) on the VQA v2 dataset.
引用
收藏
页数:11
相关论文
共 54 条
  • [1] Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering
    Agrawal, Aishwarya
    Batra, Dhruv
    Parikh, Devi
    Kembhavi, Aniruddha
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 4971 - 4980
  • [2] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
    Anderson, Peter
    He, Xiaodong
    Buehler, Chris
    Teney, Damien
    Johnson, Mark
    Gould, Stephen
    Zhang, Lei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
  • [3] VQA: Visual Question Answering
    Antol, Stanislaw
    Agrawal, Aishwarya
    Lu, Jiasen
    Mitchell, Margaret
    Batra, Dhruv
    Zitnick, C. Lawrence
    Parikh, Devi
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
  • [4] RMLVQA: A Margin Loss Approach For Visual Question Answering with Language Biases
    Basu, Abhipsa
    Addepalli, Sravanti
    Babu, R. Venkatesh
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 11671 - 11680
  • [5] Self-supervised knowledge distillation in counterfactual learning for VQA
    Bi, Yandong
    Jiang, Huajie
    Zhang, Hanfu
    Hu, Yongli
    Yin, Baocai
    [J]. PATTERN RECOGNITION LETTERS, 2024, 177 : 33 - 39
  • [6] Cadene R, 2019, ADV NEUR IN, V32
  • [7] Cao RL, 2023, PR MACH LEARN RES, V216, P249
  • [8] Mix-tower: Light visual question answering framework based on exclusive self-attention mechanism
    Chen, Deguang
    Chen, Jianrui
    Yang, Luheng
    Shang, Fanhua
    [J]. NEUROCOMPUTING, 2024, 587
  • [9] Rethinking Data Augmentation for Robust Visual Question Answering
    Chen, Long
    Zheng, Yuhang
    Xiao, Jun
    [J]. COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 95 - 112
  • [10] Chen L, 2020, PROC CVPR IEEE, P10797, DOI 10.1109/CVPR42600.2020.01081