SceneGATE: Scene-Graph Based Co-Attention Networks for Text Visual Question Answering

被引:2
|
作者
Cao, Feiqi [1 ]
Luo, Siwen [1 ]
Nunez, Felipe [1 ]
Wen, Zean [1 ]
Poon, Josiah [1 ]
Han, Soyeon Caren [1 ,2 ]
机构
[1] Univ Sydney, Fac Engn, Sch Comp Sci, Camperdown, NSW 2006, Australia
[2] Univ Western Australia, Sch Phys Maths & Comp, Dept Comp Sci & Software Engn, Crawley, WA 6009, Australia
关键词
artificial neural networks; computational and artificial intelligence; natural language processing; Visual Question Answering; scene graphs;
D O I
10.3390/robotics12040114
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
Visual Question Answering (VQA) models fail catastrophically on questions related to the reading of text-carrying images. However, TextVQA aims to answer questions by understanding the scene texts in an image-question context, such as the brand name of a product or the time on a clock from an image. Most TextVQA approaches focus on objects and scene text detection, which are then integrated with the words in a question by a simple transformer encoder. The focus of these approaches is to use shared weights during the training of a multi-modal dataset, but it fails to capture the semantic relations between an image and a question. In this paper, we proposed a Scene Graph-Based Co-Attention Network (SceneGATE) for TextVQA, which reveals the semantic relations among the objects, the Optical Character Recognition (OCR) tokens and the question words. It is achieved by a TextVQA-based scene graph that discovers the underlying semantics of an image. We create a guided-attention module to capture the intra-modal interplay between the language and the vision as a guidance for inter-modal interactions. To permit explicit teaching of the relations between the two modalities, we propose and integrate two attention modules, namely a scene graph-based semantic relation-aware attention and a positional relation-aware attention. We conduct extensive experiments on two widely used benchmark datasets, Text-VQA and ST-VQA. It is shown that our SceneGATE method outperforms existing ones because of the scene graph and its attention modules.
引用
收藏
页数:18
相关论文
共 50 条
  • [1] Co-attention graph convolutional network for visual question answering
    Liu, Chuan
    Tan, Ying-Ying
    Xia, Tian-Tian
    Zhang, Jiajing
    Zhu, Ming
    MULTIMEDIA SYSTEMS, 2023, 29 (05) : 2527 - 2543
  • [2] Co-attention graph convolutional network for visual question answering
    Chuan Liu
    Ying-Ying Tan
    Tian-Tian Xia
    Jiajing Zhang
    Ming Zhu
    Multimedia Systems, 2023, 29 : 2527 - 2543
  • [3] Sparse co-attention visual question answering networks based on thresholds
    Guo, Zihan
    Han, Dezhi
    APPLIED INTELLIGENCE, 2023, 53 (01) : 586 - 600
  • [4] Sparse co-attention visual question answering networks based on thresholds
    Zihan Guo
    Dezhi Han
    Applied Intelligence, 2023, 53 : 586 - 600
  • [5] A medical visual question answering approach based on co-attention networks
    Cui W.
    Shi W.
    Shao H.
    Shengwu Yixue Gongchengxue Zazhi/Journal of Biomedical Engineering, 2024, 41 (03): : 560 - 568
  • [6] Deep Modular Co-Attention Networks for Visual Question Answering
    Yu, Zhou
    Yu, Jun
    Cui, Yuhao
    Tao, Dacheng
    Tian, Qi
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6274 - 6283
  • [7] An Effective Dense Co-Attention Networks for Visual Question Answering
    He, Shirong
    Han, Dezhi
    SENSORS, 2020, 20 (17) : 1 - 15
  • [8] OECA-Net: A co-attention network for visual question answering based on OCR scene text feature enhancement
    Feng Yan
    Wushouer Silamu
    Yachuang Chai
    Yanbing Li
    Multimedia Tools and Applications, 2024, 83 : 7085 - 7096
  • [9] OECA-Net: A co-attention network for visual question answering based on OCR scene text feature enhancement
    Yan, Feng
    Silamu, Wushouer
    Chai, Yachuang
    Li, Yanbing
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (03) : 7085 - 7096
  • [10] Co-attention Network for Visual Question Answering Based on Dual Attention
    Dong, Feng
    Wang, Xiaofeng
    Oad, Ammar
    Talpur, Mir Sajjad Hussain
    Journal of Engineering Science and Technology Review, 2021, 14 (06) : 116 - 123