Relation-Aware Image Captioning with Hybrid-Attention for Explainable Visual Question Answering

被引:0
|
作者
Lin, Ying-Jia [1 ]
Tseng, Hing-Shan [1 ]
Kao, Hung-Yu [1 ]
机构
[1] Natl Cheng Kung Univiserty, Dept Comp Sci & Informat Engn, Tainan 701, Taiwan
关键词
visual question answering; explainable VQA; multi-task learning; graph atten- tion networks; vision-language model;
D O I
10.6688/JISE.202405_40(3).0014
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recent studies leveraging object detection as the preliminary step for Visual Question Answering (VQA) ignore the relationships between different objects inside an image based on the textual question. In addition, the previous VQA models work like black -box functions, which means it is difficult to explain why a model provides such answers to the corresponding inputs. To address the issues above, we propose a new model structure to strengthen the representations for different objects and provide explainability for the VQA task. We construct a relation graph to capture the relative positions between region pairs and then create relation -aware visual features with a relation encoder based on graph attention networks. To make the final VQA predictions explainable, we introduce a multi -task learning framework with an additional explanation generator to help our model produce reasonable explanations. Simultaneously, the generated explanations are incorporated with the visual features using a novel Hybrid -Attention mechanism to enhance cross -modal understanding. Experiments show that the proposed method performs better on the VQA task than the several baselines. In addition, incorporation with the explanation generator can provide reasonable explanations along with the predicted answers.
引用
收藏
页码:649 / 659
页数:11
相关论文
共 50 条
  • [1] Relation-Aware Image Captioning for Explainable Visual Question Answering
    Tseng, Ching-Shan
    Lin, Ying-Jia
    Kao, Hung-Yu
    2022 INTERNATIONAL CONFERENCE ON TECHNOLOGIES AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE, TAAI, 2022, : 149 - 154
  • [2] Relation-Aware Graph Attention Network for Visual Question Answering
    Li, Linjie
    Gan, Zhe
    Cheng, Yu
    Liu, Jingjing
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 10312 - 10321
  • [3] Visual question answering with gated relation-aware auxiliary
    Shao, Xiangjun
    Xiang, Zhenglong
    Li, Yuanxiang
    IET IMAGE PROCESSING, 2022, 16 (05) : 1424 - 1432
  • [4] Relation-aware Hierarchical Attention Framework for Video Question Answering
    Li, Fangtao
    Liu, Zihe
    Bai, Ting
    Yan, Chenghao
    Cao, Chenyu
    Wu, Bin
    PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 164 - 172
  • [5] Visual Relation-Aware Unsupervised Video Captioning
    Ji, Puzhao
    Cao, Meng
    Zou, Yuexian
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT III, 2022, 13531 : 495 - 507
  • [6] Image captioning improved visual question answering
    Himanshu Sharma
    Anand Singh Jalal
    Multimedia Tools and Applications, 2022, 81 : 34775 - 34796
  • [7] A BERT-based Approach with Relation-aware Attention for Knowledge Base Question Answering
    Luo, Da
    Su, Jindian
    Yu, Shanshan
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [8] Relation-Aware Question Answering for Heterogeneous Knowledge Graphs
    Du, Haowei
    Huang, Quzhe
    Li, Chen
    Zhang, Chen
    Li, Yang
    Zhao, Dongyan
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 13582 - 13592
  • [9] Image captioning improved visual question answering
    Sharma, Himanshu
    Jalal, Anand Singh
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (24) : 34775 - 34796
  • [10] Improving Visual Question Answering by Image Captioning
    Shao, Xiangjun
    Dong, Hongsong
    Wu, Guangsheng
    IEEE ACCESS, 2025, 13 : 46299 - 46311