Relation-Aware Image Captioning with Hybrid-Attention for Explainable Visual Question Answering

被引：0

作者：

Lin, Ying-Jia ^{[1
]}

Tseng, Hing-Shan ^{[1
]}

Kao, Hung-Yu ^{[1
]}

机构：

[1] Natl Cheng Kung Univiserty, Dept Comp Sci & Informat Engn, Tainan 701, Taiwan

来源：

JOURNAL OF INFORMATION SCIENCE AND ENGINEERING | 2024年 / 40卷 / 03期

关键词：

visual question answering; explainable VQA; multi-task learning; graph atten- tion networks; vision-language model;

D O I：

10.6688/JISE.202405_40(3).0014

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Recent studies leveraging object detection as the preliminary step for Visual Question Answering (VQA) ignore the relationships between different objects inside an image based on the textual question. In addition, the previous VQA models work like black -box functions, which means it is difficult to explain why a model provides such answers to the corresponding inputs. To address the issues above, we propose a new model structure to strengthen the representations for different objects and provide explainability for the VQA task. We construct a relation graph to capture the relative positions between region pairs and then create relation -aware visual features with a relation encoder based on graph attention networks. To make the final VQA predictions explainable, we introduce a multi -task learning framework with an additional explanation generator to help our model produce reasonable explanations. Simultaneously, the generated explanations are incorporated with the visual features using a novel Hybrid -Attention mechanism to enhance cross -modal understanding. Experiments show that the proposed method performs better on the VQA task than the several baselines. In addition, incorporation with the explanation generator can provide reasonable explanations along with the predicted answers.

引用

页码：649 / 659

页数：11

共 50 条

[1] Relation-Aware Image Captioning for Explainable Visual Question Answering
Tseng, Ching-Shan
Lin, Ying-Jia
Kao, Hung-Yu
2022 INTERNATIONAL CONFERENCE ON TECHNOLOGIES AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE, TAAI, 2022, : 149 - 154
[2] Relation-Aware Graph Attention Network for Visual Question Answering
Li, Linjie
Gan, Zhe
Cheng, Yu
Liu, Jingjing
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 10312 - 10321
[3] Visual question answering with gated relation-aware auxiliary
Shao, Xiangjun
Xiang, Zhenglong
Li, Yuanxiang
IET IMAGE PROCESSING, 2022, 16 (05) : 1424 - 1432
[4] Relation-aware Hierarchical Attention Framework for Video Question Answering
Li, Fangtao
Liu, Zihe
Bai, Ting
Yan, Chenghao
Cao, Chenyu
Wu, Bin
PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 164 - 172
[5] Visual Relation-Aware Unsupervised Video Captioning
Ji, Puzhao
Cao, Meng
Zou, Yuexian
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT III, 2022, 13531 : 495 - 507
[6] Image captioning improved visual question answering
Himanshu Sharma
Anand Singh Jalal
Multimedia Tools and Applications, 2022, 81 : 34775 - 34796
[7] A BERT-based Approach with Relation-aware Attention for Knowledge Base Question Answering
Luo, Da
Su, Jindian
Yu, Shanshan
2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
[8] Relation-Aware Question Answering for Heterogeneous Knowledge Graphs
Du, Haowei
Huang, Quzhe
Li, Chen
Zhang, Chen
Li, Yang
Zhao, Dongyan
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 13582 - 13592
[9] Image captioning improved visual question answering
Sharma, Himanshu
Jalal, Anand Singh
MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (24) : 34775 - 34796
[10] Improving Visual Question Answering by Image Captioning
Shao, Xiangjun
Dong, Hongsong
Wu, Guangsheng
IEEE ACCESS, 2025, 13 : 46299 - 46311

← 1 2 3 4 5 →