HCCL: Hierarchical Counterfactual Contrastive Learning for Robust Visual Question Answering

被引:0
作者
Hao, Dongze [1 ,2 ]
Wang, Qunbo [1 ]
Zhu, Xinxin [1 ]
Liu, Jing [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Automat, Lab Cognit & Decis Intelligence Complex Syst, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Visual question answering; hierarchical counterfactual contrastive learning; robust VQA;
D O I
10.1145/3673902
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Despite most state-of-the-art models having achieved amazing performance in Visual Question Answering (VQA), they usually utilize biases to answer the question. Recently, some studies synthesize counterfactual training samples to help the model to mitigate the biases. However, these synthetic samples need extra annotations and often contain noises. Moreover, these methods simply add synthetic samples to the training data to train the model with the cross-entropy loss, which cannot make the best use of synthetic samples to mitigate the biases. In this article, to mitigate the biases in VQA more effectively, we propose a Hierarchical Counterfactual Contrastive Learning (HCCL) method. Firstly, to avoid introducing noises and extra annotations, our method automatically masks the unimportant features in original pairs to obtain positive samples and create mismatched question-image pairs as negative samples. Then our method uses feature-level and answer-level contrastive learning to make the original sample close to positive samples in the feature space, while away from negative samples in both feature and answer spaces. In this way, the VQA model can learn the robust multimodal features and focus on both visual and language information to produce the answer. Our HCCL method can be adopted in different baselines, and the experimental results on VQA v2, VQA-CP, and GQA-OOD datasets show that our method is effective in mitigating the biases in VQA, which improves the robustness of the VQA model.
引用
收藏
页数:21
相关论文
共 50 条
  • [31] Learning a Mixture of Conditional Gating Blocks for Visual Question Answering
    Sun, Qiang
    Fu, Yan-Wei
    Xue, Xiang-Yang
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2024, 39 (04) : 912 - 928
  • [32] Explicit ensemble attention learning for improving visual question answering
    Lioutas, Vasileios
    Passalis, Nikolaos
    Tefas, Anastasios
    PATTERN RECOGNITION LETTERS, 2018, 111 : 51 - 57
  • [33] Learning visual question answering on controlled semantic noisy labels
    Zhang, Haonan
    Zeng, Pengpeng
    Hu, Yuxuan
    Qian, Jin
    Song, Jingkuan
    Gao, Lianli
    PATTERN RECOGNITION, 2023, 138
  • [34] Learning to enhance areal video captioning with visual question answering
    Al Mehmadi, Shima M.
    Bazi, Yakoub
    Al Rahhal, Mohamad M.
    Zuair, Mansour
    INTERNATIONAL JOURNAL OF REMOTE SENSING, 2024, 45 (18) : 6395 - 6407
  • [35] Learning to Recognize Visual Concepts for Visual Question Answering With Structural Label Space
    Gao, Difei
    Wang, Ruiping
    Shan, Shiguang
    Chen, Xilin
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2020, 14 (03) : 494 - 505
  • [36] Erasing-based Attention Learning for Visual Question Answering
    Liu, Fei
    Liu, Jing
    Hong, Richang
    Lu, Hanqing
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1175 - 1183
  • [37] A survey of deep learning-based visual question answering
    Huang, Tong-yuan
    Yang, Yu-ling
    Yang, Xue-jiao
    JOURNAL OF CENTRAL SOUTH UNIVERSITY, 2021, 28 (03) : 728 - 746
  • [38] Hierarchical deep multi-modal network for medical visual question answering
    Gupta D.
    Suman S.
    Ekbal A.
    Expert Systems with Applications, 2021, 164
  • [39] Survey on Visual Question Answering
    Bao X.-G.
    Zhou C.-L.
    Xiao K.-J.
    Qin B.
    Ruan Jian Xue Bao/Journal of Software, 2021, 32 (08): : 2522 - 2544
  • [40] VQA: Visual Question Answering
    Agrawal, Aishwarya
    Lu, Jiasen
    Antol, Stanislaw
    Mitchell, Margaret
    Zitnick, C. Lawrence
    Parikh, Devi
    Batra, Dhruv
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2017, 123 (01) : 4 - 31