Relation-Aware Image Captioning for Explainable Visual Question Answering

被引:1
|
作者
Tseng, Ching-Shan [1 ]
Lin, Ying-Jia [1 ]
Kao, Hung-Yu [1 ]
机构
[1] Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Intelligent Knowledge Management Lab, Tainan, Taiwan
关键词
visual question answering; image captioning; explainable VQA; cross-modality learning; multi-task learning;
D O I
10.1109/TAAI57707.2022.00035
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent studies leveraging object detection models for Visual Question Answering (VQA) ignore the correlations or interactions between multiple objects. In addition, the previous VQA models are black boxes for human beings, which means it is difficult to explain why a model returns correct or wrong answers. To solve the problems above, we propose a new model structure with image captioning for the VQA task. Our model constructs a relation graph according to the relative positions between region pairs and then produces relation-aware visual features with a relation encoder. To make the predictions explainable, we introduce an image captioning module and conduct a multi-task training process. In the meantime, the generated captions are injected into the predictor to assist cross-modal understanding. Experiments show that our model can generate meaningful answers and explanations according to the questions and images. Besides, the relation encoder and the caption-attended predictor lead to improvement for different types of questions.
引用
收藏
页码:149 / 154
页数:6
相关论文
共 50 条
  • [1] Relation-Aware Image Captioning with Hybrid-Attention for Explainable Visual Question Answering
    Lin, Ying-Jia
    Tseng, Ching-Shan
    Kao, Hung-Yu
    JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2024, 40 (03) : 649 - 659
  • [2] Visual question answering with gated relation-aware auxiliary
    Shao, Xiangjun
    Xiang, Zhenglong
    Li, Yuanxiang
    IET IMAGE PROCESSING, 2022, 16 (05) : 1424 - 1432
  • [3] Relation-Aware Graph Attention Network for Visual Question Answering
    Li, Linjie
    Gan, Zhe
    Cheng, Yu
    Liu, Jingjing
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 10312 - 10321
  • [4] Visual Relation-Aware Unsupervised Video Captioning
    Ji, Puzhao
    Cao, Meng
    Zou, Yuexian
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT III, 2022, 13531 : 495 - 507
  • [5] Image captioning improved visual question answering
    Himanshu Sharma
    Anand Singh Jalal
    Multimedia Tools and Applications, 2022, 81 : 34775 - 34796
  • [6] Relation-Aware Question Answering for Heterogeneous Knowledge Graphs
    Du, Haowei
    Huang, Quzhe
    Li, Chen
    Zhang, Chen
    Li, Yang
    Zhao, Dongyan
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 13582 - 13592
  • [7] Image captioning improved visual question answering
    Sharma, Himanshu
    Jalal, Anand Singh
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (24) : 34775 - 34796
  • [8] Improving Visual Question Answering by Image Captioning
    Shao, Xiangjun
    Dong, Hongsong
    Wu, Guangsheng
    IEEE ACCESS, 2025, 13 : 46299 - 46311
  • [9] Relation-Aware Language-Graph Transformer for Question Answering
    Park, Jinyoung
    Choi, Hyeong Kyu
    Ko, Juyeon
    Park, Hyeonjin
    Kim, Ji-Hoon
    Jeong, Jisu
    Kim, Kyungmin
    Kim, Hyunwoo J.
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 13457 - 13464
  • [10] Relation-aware Hierarchical Attention Framework for Video Question Answering
    Li, Fangtao
    Liu, Zihe
    Bai, Ting
    Yan, Chenghao
    Cao, Chenyu
    Wu, Bin
    PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 164 - 172