Object-difference drived graph convolutional networks for visual question answering

被引:23
|
作者
Zhu, Xi [1 ,2 ]
Mao, Zhendong [3 ]
Chen, Zhineng [4 ]
Li, Yangyang [5 ]
Wang, Zhaohui [1 ,2 ]
Wang, Bin [6 ]
机构
[1] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing, Peoples R China
[3] Univ Sci & Technol China, Hefei, Peoples R China
[4] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China
[5] China Acad Elect & Informat Technol, Beijing, Peoples R China
[6] Xiaomi Inc, Xiaomi AI Lab, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Visual question answering; Graph convolutional networks; Object-difference;
D O I
10.1007/s11042-020-08790-0
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Visual Question Answering(VQA), an important task to evaluate the cross-modal understanding capability of an Artificial Intelligence model, has been a hot research topic in both computer vision and natural language processing communities. Recently, graph-based models have received growing interest in VQA, for its potential of modeling the relationships between objects as well as its formidable interpretability. Nonetheless, those solutions mainly define the similarity between objects as their semantical relationships, while largely ignoring the critical point that the difference between objects can provide more information for establishing the relationship between nodes in the graph. To achieve this, we propose an object-difference based graph learner, which learns question-adaptive semantic relations by calculating inter-object difference under the guidance of questions. With the learned relationships, the input image can be represented as an object graph encoded with structural dependencies between objects. In addition, existing graph-based models leverage the pre-extracted object boxes by the object detection model as node features for convenience, but they are suffering from the redundancy problem. To reduce the redundant objects, we introduce a soft-attention mechanism to magnify the question-related objects. Moreover, we incorporate our object-difference based graph learner into the soft-attention based Graph Convolutional Networks to capture question-specific objects and their interactions for answer prediction. Our experimental results on the VQA 2.0 dataset demonstrate that our model gives significantly better performance than baseline methods.
引用
收藏
页码:16247 / 16265
页数:19
相关论文
共 50 条
  • [1] Object-difference drived graph convolutional networks for visual question answering
    Xi Zhu
    Zhendong Mao
    Zhineng Chen
    Yangyang Li
    Zhaohui Wang
    Bin Wang
    Multimedia Tools and Applications, 2021, 80 : 16247 - 16265
  • [2] Object-Difference Attention: A Simple Relational Attention for Visual Question Answering
    Wu, Chenfei
    Liu, Jinlai
    Wang, Xiaojie
    Dong, Xuan
    PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 519 - 527
  • [3] An analysis of graph convolutional networks and recent datasets for visual question answering
    Yusuf, Abdulganiyu Abdu
    Feng Chong
    Mao Xianling
    ARTIFICIAL INTELLIGENCE REVIEW, 2022, 55 (08) : 6277 - 6300
  • [4] An analysis of graph convolutional networks and recent datasets for visual question answering
    Abdulganiyu Abdu Yusuf
    Feng Chong
    Mao Xianling
    Artificial Intelligence Review, 2022, 55 : 6277 - 6300
  • [5] Evaluation of graph convolutional networks performance for visual question answering on reasoning datasets
    Abdulganiyu Abdu Yusuf
    Feng Chong
    Mao Xianling
    Multimedia Tools and Applications, 2022, 81 : 40361 - 40370
  • [6] Evaluation of graph convolutional networks performance for visual question answering on reasoning datasets
    Yusuf, Abdulganiyu Abdu
    Feng Chong
    Mao Xianling
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (28) : 40361 - 40370
  • [7] Bilinear Graph Networks for Visual Question Answering
    Guo, Dalu
    Xu, Chang
    Tao, Dacheng
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (02) : 1023 - 1034
  • [8] Question Answering by Reasoning Across Documents with Graph Convolutional Networks
    De Cao, Nicola
    Aziz, Wilker
    Titov, Ivan
    2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 2306 - 2317
  • [9] Co-attention graph convolutional network for visual question answering
    Liu, Chuan
    Tan, Ying-Ying
    Xia, Tian-Tian
    Zhang, Jiajing
    Zhu, Ming
    MULTIMEDIA SYSTEMS, 2023, 29 (05) : 2527 - 2543
  • [10] Co-attention graph convolutional network for visual question answering
    Chuan Liu
    Ying-Ying Tan
    Tian-Tian Xia
    Jiajing Zhang
    Ming Zhu
    Multimedia Systems, 2023, 29 : 2527 - 2543