Quantifying and Alleviating the Language Prior Problem in Visual Question Answering

被引:24
|
作者
Guo, Yangyang [1 ]
Cheng, Zhiyong [2 ]
Nie, Liqiang [1 ]
Liu, Yibing [1 ]
Wang, Yinglong [2 ]
Kankanhalli, Mohan [3 ]
机构
[1] Shandong Univ, Sch Comp Sci & Technol, Jinan, Shandong, Peoples R China
[2] Qilu Univ Technol, Shandong Acad Sci, Natl Supercomp Ctr Jinan, Shandong Comp Sci Ctr, Jinan, Shandong, Peoples R China
[3] Natl Univ Singapore, Sch Comp, Singapore, Singapore
来源
PROCEEDINGS OF THE 42ND INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '19) | 2019年
基金
中国国家自然科学基金; 新加坡国家研究基金会;
关键词
Visual Question Answering; Language Prior Problem; Evaluation Metric;
D O I
10.1145/3331184.3331186
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Benefiting from the advancement of computer vision, natural language processing and information retrieval techniques, visual question answering (VQA), which aims to answer questions about an image or a video, has received lots of attentions over the past few years. Although some progress has been achieved so far, several studies have pointed out that current VQA models are heavily affected by the language prior problem, which means they tend to answer questions based on the co-occurrence patterns of question keywords (e.g., how many) and answers (e.g., 2) instead of understanding images and questions. Existing methods attempt to solve this problem by either balancing the biased datasets or forcing models to better understand images. However, only marginal effects and even performance deterioration are observed for the first and second solution, respectively. In addition, another important issue is the inability to quantitatively measure the extent of the language prior effect, which severely hinders the advancement of related techniques. In this paper, we make contributions towards solving the above problems from two perspectives. Firstly, we design a metric to quantitatively measure the language prior effect on VQA models. The proposed metric has been demonstrated to be effective in our empirical studies. Secondly, we propose a regularization method (i.e., score regularization module) to enhance current VQA models by alleviating the language prior problem as well as boosting the backbone model performance. The proposed score regularization module adopts a pair-wise learning strategy, which makes the VQA models answer the question based on the reasoning on the image (upon this question) instead of basing on question-answer patterns observed in the biased training set. The score regularization module is versatile to be integrated into various VQA models. We conducted extensive experiments over two popular VQA datasets (i.e., VQA 1.0 and VQA 2.0) and integrated the score regularization module into three state-of-the-art VQA models. Experimental results show that the score regularization module can not only effectively reduce the language prior problem of these VQA models but also consistently improve their question answering accuracy.
引用
收藏
页码:75 / 84
页数:10
相关论文
共 50 条
  • [1] Handling language prior and compositional reasoning issues in Visual Question Answering system
    Chowdhury, Souvik
    Soni, Badal
    NEUROCOMPUTING, 2025, 635
  • [2] PRIOR VISUAL RELATIONSHIP REASONING FOR VISUAL QUESTION ANSWERING
    Yang, Zhuoqian
    Qin, Zengchang
    Yu, Jing
    Wan, Tao
    2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 1411 - 1415
  • [3] LANGUAGE AND VISUAL RELATIONS ENCODING FOR VISUAL QUESTION ANSWERING
    Liu, Fei
    Liu, Jing
    Fang, Zhiwei
    Lu, Hanqing
    2019 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2019, : 3307 - 3311
  • [4] Multiview Language Bias Reduction for Visual Question Answering
    Li, Pengju
    Tan, Zhiyi
    Bao, Bing-Kun
    IEEE MULTIMEDIA, 2023, 30 (01) : 91 - 99
  • [5] An Empirical Study on the Language Modal in Visual Question Answering
    Peng, Daowan
    Wei, Wei
    Mao, Xian-Ling
    Fu, Yuanyuan
    Chen, Dangyang
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 4109 - 4117
  • [6] LANGUAGE TRANSFORMERS FOR REMOTE SENSING VISUAL QUESTION ANSWERING
    Chappuis, Christel
    Mendez, Vincent
    Walt, Eliot
    Lobry, Sylvain
    Le Saux, Bertrand
    Tuia, Devis
    2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 4855 - 4858
  • [7] LPF: A Language-Prior Feedback Objective Function for De-biased Visual Question Answering
    Liang, Zujie
    Hu, Haifeng
    Zhu, Jiaying
    SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1955 - 1959
  • [8] Ensemble approach for natural language question answering problem
    Aniol, Anna
    Pietron, Marcin
    Duda, Jerzy
    2019 SEVENTH INTERNATIONAL SYMPOSIUM ON COMPUTING AND NETWORKING WORKSHOPS (CANDARW 2019), 2019, : 180 - 183
  • [9] Multiple Meta-model Quantifying for Medical Visual Question Answering
    Do, Thong
    Nguyen, Binh X.
    Tjiputra, Erman
    Tran, Minh
    Tran, Quang D.
    Anh Nguyen
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2021, PT V, 2021, 12905 : 64 - 74
  • [10] Visual Question Answering
    Nada, Ahmed
    Chen, Min
    2024 INTERNATIONAL CONFERENCE ON COMPUTING, NETWORKING AND COMMUNICATIONS, ICNC, 2024, : 6 - 10