Relation-Aware Image Captioning for Explainable Visual Question Answering

被引:1
|
作者
Tseng, Ching-Shan [1 ]
Lin, Ying-Jia [1 ]
Kao, Hung-Yu [1 ]
机构
[1] Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Intelligent Knowledge Management Lab, Tainan, Taiwan
关键词
visual question answering; image captioning; explainable VQA; cross-modality learning; multi-task learning;
D O I
10.1109/TAAI57707.2022.00035
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent studies leveraging object detection models for Visual Question Answering (VQA) ignore the correlations or interactions between multiple objects. In addition, the previous VQA models are black boxes for human beings, which means it is difficult to explain why a model returns correct or wrong answers. To solve the problems above, we propose a new model structure with image captioning for the VQA task. Our model constructs a relation graph according to the relative positions between region pairs and then produces relation-aware visual features with a relation encoder. To make the predictions explainable, we introduce an image captioning module and conduct a multi-task training process. In the meantime, the generated captions are injected into the predictor to assist cross-modal understanding. Experiments show that our model can generate meaningful answers and explanations according to the questions and images. Besides, the relation encoder and the caption-attended predictor lead to improvement for different types of questions.
引用
收藏
页码:149 / 154
页数:6
相关论文
共 50 条
  • [31] Learning to enhance areal video captioning with visual question answering
    Al Mehmadi, Shima M.
    Bazi, Yakoub
    Al Rahhal, Mohamad M.
    Zuair, Mansour
    INTERNATIONAL JOURNAL OF REMOTE SENSING, 2024, 45 (18) : 6395 - 6407
  • [32] CHANGE-AWARE VISUAL QUESTION ANSWERING
    Yuan, Zhenghang
    Mou, Lichao
    Zhu, Xiao Xiang
    2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 227 - 230
  • [33] Mood-aware visual question answering
    Ruwa, Nelson
    Mao, Qirong
    Wang, Liangjun
    Gou, Jianping
    Dong, Ming
    NEUROCOMPUTING, 2019, 330 : 305 - 316
  • [34] Relation-Aware Multi-Pass Comparison Deconfounded Network for Change Captioning
    Lu, Zhicong
    Jin, Li
    Chen, Ziwei
    Tian, Changyuan
    Sun, Xian
    Li, Xiaoyu
    Zhang, Yi
    Li, Qi
    Xu, Guangluan
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (12) : 13349 - 13363
  • [35] Evaluating the Fidelity of Image Captioning via Weighted Boolean Question Answering
    Wang, Kaixuan
    Li, Shasha
    Tang, Jintao
    Long, Kehan
    Miao, Yongzhu
    Chen, Fangda
    Wang, Ting
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT III, NLPCC 2024, 2025, 15361 : 356 - 368
  • [36] KVQA: Knowledge-Aware Visual Question Answering
    Shah, Sanket
    Mishra, Anand
    Yadati, Naganand
    Talukdar, Partha Pratim
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8876 - 8884
  • [37] TYPE-AWARE MEDICAL VISUAL QUESTION ANSWERING
    Zhang, Anda
    Tao, Wei
    Li, Ziyan
    Wang, Haofen
    Zhang, Wenqiang
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4838 - 4842
  • [38] Privacy-Aware Document Visual Question Answering
    Tito, Ruben
    Nguyen, Khanh
    Tobaben, Marlon
    Kerkouche, Raouf
    Souibgui, Mohamed Ali
    Jung, Kangsoo
    Jalko, Joonas
    DAndecy, Vincent Poulain
    Joseph, Aurelie
    Kang, Lei
    Valveny, Ernest
    Honkela, Antti
    Fritz, Mario
    Karatzas, Dimosthenis
    DOCUMENT ANALYSIS AND RECOGNITION-ICDAR 2024, PT VI, 2024, 14809 : 199 - 218
  • [39] RTRL: Relation-aware Transformer with Reinforcement Learning for Deep Question Generation
    Zeng, Hongwei
    Wei, Bifan
    Liu, Jun
    KNOWLEDGE-BASED SYSTEMS, 2024, 300
  • [40] Relation-aware Instance Refinement for Weakly Supervised Visual Grounding
    Liu, Yongfei
    Wan, Bo
    Ma, Lin
    He, Xuming
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 5608 - 5617