Relation-Aware Image Captioning for Explainable Visual Question Answering

被引:1
|
作者
Tseng, Ching-Shan [1 ]
Lin, Ying-Jia [1 ]
Kao, Hung-Yu [1 ]
机构
[1] Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Intelligent Knowledge Management Lab, Tainan, Taiwan
关键词
visual question answering; image captioning; explainable VQA; cross-modality learning; multi-task learning;
D O I
10.1109/TAAI57707.2022.00035
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent studies leveraging object detection models for Visual Question Answering (VQA) ignore the correlations or interactions between multiple objects. In addition, the previous VQA models are black boxes for human beings, which means it is difficult to explain why a model returns correct or wrong answers. To solve the problems above, we propose a new model structure with image captioning for the VQA task. Our model constructs a relation graph according to the relative positions between region pairs and then produces relation-aware visual features with a relation encoder. To make the predictions explainable, we introduce an image captioning module and conduct a multi-task training process. In the meantime, the generated captions are injected into the predictor to assist cross-modal understanding. Experiments show that our model can generate meaningful answers and explanations according to the questions and images. Besides, the relation encoder and the caption-attended predictor lead to improvement for different types of questions.
引用
收藏
页码:149 / 154
页数:6
相关论文
共 50 条
  • [41] Position-aware image captioning with spatial relation
    Duan, Yiqun
    Wang, Zhen
    Wang, Jingya
    Wang, Yu-Kai
    Lin, Chin-Teng
    Neurocomputing, 2022, 497 : 28 - 38
  • [42] Position-aware image captioning with spatial relation
    Duan, Yiqun
    Wang, Zhen
    Wang, Jingya
    Wang, Yu-Kai
    Lin, Chin-Teng
    NEUROCOMPUTING, 2022, 497 : 28 - 38
  • [43] Question-controlled Text-aware Image Captioning
    Hu, Anwen
    Chen, Shizhe
    Jin, Qin
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 3097 - 3105
  • [44] LEARNING REPRESENTATIONS FROM EXPLAINABLE AND CONNECTIONIST APPROACHES FOR VISUAL QUESTION ANSWERING
    Mishra, Aakansha
    Soumitri, Miriyala Srinivas
    Rajendiran, Vikram N.
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 6420 - 6424
  • [45] Knowledge-aware image understanding with multi-level visual representation enhancement for visual question answering
    Feng Yan
    Zhe Li
    Wushour Silamu
    Yanbing Li
    Machine Learning, 2024, 113 : 3789 - 3805
  • [46] Knowledge-aware image understanding with multi-level visual representation enhancement for visual question answering
    Yan, Feng
    Li, Zhe
    Silamu, Wushour
    Li, Yanbing
    MACHINE LEARNING, 2024, 113 (06) : 3789 - 3805
  • [47] Question-aware prediction with candidate answer recommendation for visual question answering
    Kim, B.
    Kim, J.
    ELECTRONICS LETTERS, 2017, 53 (18) : 1244 - 1245
  • [48] CONTEXT RELATION FUSION MODEL FOR VISUAL QUESTION ANSWERING
    Zhang, Haotian
    Wu, Wei
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 2112 - 2116
  • [49] Decoupled Box Proposal and Featurization with Ultrafine-Grained Semantic Labels Improve Image Captioning and Visual Question Answering
    Changpinyo, Soravit
    Pang, Bo
    Sharma, Piyush
    Soricut, Radu
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 1468 - 1474
  • [50] ConceptBert: Concept-Aware Representation for Visual Question Answering
    Garderes, Francois
    Ziaeefard, Maryam
    Abeloos, Baptiste
    Lecue, Freddy
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 489 - 498