Co-attention graph convolutional network for visual question answering

被引:0
作者
Chuan Liu
Ying-Ying Tan
Tian-Tian Xia
Jiajing Zhang
Ming Zhu
机构
[1] Anhui Jianzhu University,School of Mathematics and Physics
[2] Anhui University,School of Integrated Circuits
[3] Anhui Jianzhu University,Operations Research and Data Science Laboratory
来源
Multimedia Systems | 2023年 / 29卷
关键词
Visual question answering; Binary relational reasoning; Spatial graph convolution; Attention mechanism;
D O I
暂无
中图分类号
学科分类号
摘要
Visual Question Answering (VQA) is a challenging task that requires a fine-grained understanding of both the visual content of images and the textual content of questions. Conventional visual attention model, which is designed primarily from the perspective of attention mechanism, lacks the ability to reason about relationships between visual objects and ignores the multimodal interactions between questions and images. In this work, we propose a combined both graph convolutional network and co-attention network to circumvent the aforementioned problem. The model employs binary relational reasoning as the graph learner module to learn a graph structure representation that captures relationships between visual objects and learns image representation related to the specific question that has an awareness of spatial location via spatial graph convolution. After that, we perform parallel co-attention learning by passing image representations and features of question words through a deep co-attention module. Experiment results demonstrate that the Overall accuracy of our model delivers 68.67%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$68.67\%$$\end{document} on the test-std set of the benchmark VQA v2.0 dataset, which outperforms most existing models.
引用
收藏
页码:2527 / 2543
页数:16
相关论文
共 51 条
  • [1] Cheng Y(2022)Cross-modal graph matching network for image-text retrieval ACM Transact. Multimedia. Comp. Communicat. Appl. (TOMM) 18 1-23
  • [2] Zhu X(2020)Multimodal feature fusion by relational reasoning and attention for visual question answering Informat. Fusion 55 116-126
  • [3] Qian J(2019)Multi-source multi-level attention networks for visual question answering ACM Transact. Mult. Comput., Communicat., Applicat. 15 1-20
  • [4] Wen F(2021)Object-difference drived graph convolutional networks for visual question answering Mult. Tools Appl. 80 16247-16265
  • [5] Liu P(2022)Visual-semantic graph neural network with pose-position attentive learning for group activity recognition Neurocomputing 491 217-231
  • [6] Zhang W(2019)Interpretable visual question answering by reasoning on dependency trees IEEE Transact. Pattern Anal. Mach. Intell. 43 887-901
  • [7] Yu J(2017)Image captioning and visual question answering based on attributes and external knowledge IEEE Transact. Pattern Anal. Mach. Intell. 40 1367-1381
  • [8] Hu H(2017)Faster r-cnn: Towards real-time object detection with region proposal networks IEEE Transact. Pattern Anal. Mach. Intell. 39 1137-1149
  • [9] Hu H(2021)Research on visual question answering based on deep stacked attention network J. Phys. 1873 1-8
  • [10] Qin Z(2022)Evaluation of graph convolutional networks performance for visual question answering on reasoning datasets Multimed. Tools Appl. 81 40361-40370