SceneGATE: Scene-Graph Based Co-Attention Networks for Text Visual Question Answering

被引：2

作者：

Cao, Feiqi ^{[1
]}

Luo, Siwen ^{[1
]}

Nunez, Felipe ^{[1
]}

Wen, Zean ^{[1
]}

Poon, Josiah ^{[1
]}

Han, Soyeon Caren ^{[1
,2
]}

机构：

[1] Univ Sydney, Fac Engn, Sch Comp Sci, Camperdown, NSW 2006, Australia

[2] Univ Western Australia, Sch Phys Maths & Comp, Dept Comp Sci & Software Engn, Crawley, WA 6009, Australia

来源：

ROBOTICS | 2023年 / 12卷 / 04期

关键词：

artificial neural networks; computational and artificial intelligence; natural language processing; Visual Question Answering; scene graphs;

D O I：

10.3390/robotics12040114

中图分类号：

TP24 [机器人技术];

学科分类号：

080202 ; 1405 ;

摘要：

Visual Question Answering (VQA) models fail catastrophically on questions related to the reading of text-carrying images. However, TextVQA aims to answer questions by understanding the scene texts in an image-question context, such as the brand name of a product or the time on a clock from an image. Most TextVQA approaches focus on objects and scene text detection, which are then integrated with the words in a question by a simple transformer encoder. The focus of these approaches is to use shared weights during the training of a multi-modal dataset, but it fails to capture the semantic relations between an image and a question. In this paper, we proposed a Scene Graph-Based Co-Attention Network (SceneGATE) for TextVQA, which reveals the semantic relations among the objects, the Optical Character Recognition (OCR) tokens and the question words. It is achieved by a TextVQA-based scene graph that discovers the underlying semantics of an image. We create a guided-attention module to capture the intra-modal interplay between the language and the vision as a guidance for inter-modal interactions. To permit explicit teaching of the relations between the two modalities, we propose and integrate two attention modules, namely a scene graph-based semantic relation-aware attention and a positional relation-aware attention. We conduct extensive experiments on two widely used benchmark datasets, Text-VQA and ST-VQA. It is shown that our SceneGATE method outperforms existing ones because of the scene graph and its attention modules.

引用

页数：18

共 50 条

[1] Co-attention graph convolutional network for visual question answering
Liu, Chuan
Tan, Ying-Ying
Xia, Tian-Tian
Zhang, Jiajing
Zhu, Ming
MULTIMEDIA SYSTEMS, 2023, 29 (05) : 2527 - 2543
[2] Co-attention graph convolutional network for visual question answering
Chuan Liu
Ying-Ying Tan
Tian-Tian Xia
Jiajing Zhang
Ming Zhu
Multimedia Systems, 2023, 29 : 2527 - 2543
[3] Sparse co-attention visual question answering networks based on thresholds
Guo, Zihan
Han, Dezhi
APPLIED INTELLIGENCE, 2023, 53 (01) : 586 - 600
[4] Sparse co-attention visual question answering networks based on thresholds
Zihan Guo
Dezhi Han
Applied Intelligence, 2023, 53 : 586 - 600
[5] A medical visual question answering approach based on co-attention networks
Cui W.
Shi W.
Shao H.
Shengwu Yixue Gongchengxue Zazhi/Journal of Biomedical Engineering, 2024, 41 (03): : 560 - 568
[6] Deep Modular Co-Attention Networks for Visual Question Answering
Yu, Zhou
Yu, Jun
Cui, Yuhao
Tao, Dacheng
Tian, Qi
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6274 - 6283
[7] An Effective Dense Co-Attention Networks for Visual Question Answering
He, Shirong
Han, Dezhi
SENSORS, 2020, 20 (17) : 1 - 15
[8] OECA-Net: A co-attention network for visual question answering based on OCR scene text feature enhancement
Feng Yan
Wushouer Silamu
Yachuang Chai
Yanbing Li
Multimedia Tools and Applications, 2024, 83 : 7085 - 7096
[9] OECA-Net: A co-attention network for visual question answering based on OCR scene text feature enhancement
Yan, Feng
Silamu, Wushouer
Chai, Yachuang
Li, Yanbing
MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (03) : 7085 - 7096
[10] Co-attention Network for Visual Question Answering Based on Dual Attention
Dong, Feng
Wang, Xiaofeng
Oad, Ammar
Talpur, Mir Sajjad Hussain
Journal of Engineering Science and Technology Review, 2021, 14 (06) : 116 - 123

← 1 2 3 4 5 →