Hybrid Graph Reasoning With Dynamic Interaction for Visual Dialog

被引:1
|
作者
Du, Shanshan [1 ,2 ]
Wang, Hanli [1 ,2 ]
Li, Tengpeng [1 ,2 ]
Chen, Chang Wen [3 ]
机构
[1] Tongji Univ, Dept Comp Sci & Technol, Shanghai 201804, Peoples R China
[2] Tongji Univ, Serv Comp, Minist Educ, Key Lab Embedded Syst, Shanghai 200092, Peoples R China
[3] Hong Kong Polytech Univ, Dept Comp, Hong Kong, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Cognition; Semantics; Task analysis; Routing; History; Transformers; Cross-modal interaction; dynamic routing; graph neural network; graph reasoning; visual dialog;
D O I
10.1109/TMM.2024.3385997
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
As a pivotal branch of intelligent human-computer interaction, visual dialog is a technically challenging task that requires artificial intelligence (AI) agents to answer consecutive questions based on image content and history dialog. Despite considerable progresses, visual dialog still suffers from two major problems: (1) how to design flexible cross-modal interaction patterns instead of over-reliance on expert experience and (2) how to infer underlying semantic dependencies between dialogues effectively. To address these issues, an end-to-end framework employing dynamic interaction and hybrid graph reasoning is proposed in this work. Specifically, three major components are designed and the practical benefits are demonstrated by extensive experiments. First, a dynamic interaction module is developed to automatically determine the optimal modality interaction route for multifarious questions, which consists of three elaborate functional interaction blocks endowed with dynamic routers. Second, a hybrid graph reasoning module is designed to explore adequate semantic associations between dialogues from multiple perspectives, where the hybrid graph is constructed by aggregating a structured coreference graph and a context-aware temporal graph. Third, a unified one-stage visual dialog model with an end-to-end structure is developed to train the dynamic interaction module and the hybrid graph reasoning module in a collaborative manner. Extensive experiments on the benchmark datasets of VisDial v0.9 and VisDial v1.0 demonstrate the effectiveness of the proposed method compared to other state-of-the-art approaches.
引用
收藏
页码:9095 / 9108
页数:14
相关论文
共 50 条
  • [11] Scene Graph Refinement Network for Visual Question Answering
    Qian, Tianwen
    Chen, Jingjing
    Chen, Shaoxiang
    Wu, Bo
    Jiang, Yu-Gang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 3950 - 3961
  • [12] Multi-Modal Structure-Embedding Graph Transformer for Visual Commonsense Reasoning
    Zhu, Jian
    Wang, Hanli
    He, Bin
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 1295 - 1305
  • [13] DisAVR: Disentangled Adaptive Visual Reasoning Network for Diagram Question Answering
    Wang, Yaxian
    Wei, Bifan
    Liu, Jun
    Zhang, Lingling
    Wang, Jiaxin
    Wang, Qianying
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 4812 - 4827
  • [14] Event Graph Guided Compositional Spatial--Temporal Reasoning for Video Question Answering
    Bai, Ziyi
    Wang, Ruiping
    Gao, Difei
    Chen, Xilin
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1109 - 1121
  • [15] Recurrent Adaptive Graph Reasoning Network With Region and Boundary Interaction for Salient Object Detection in Optical Remote Sensing Images
    Zhao, Jie
    Jia, Yun
    Ma, Lin
    Yu, Lidan
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [16] TodBR: Target-Oriented Dialog with Bidirectional Reasoning on Knowledge Graph
    Qu, Zongfeng
    Yang, Zhitong
    Wang, Bo
    Hu, Qinghua
    APPLIED SCIENCES-BASEL, 2024, 14 (01):
  • [17] Improving Visual Reasoning Through Semantic Representation
    Zheng, Wenfeng
    Liu, Xiangjun
    Ni, Xubin
    Yin, Lirong
    Yang, Bo
    IEEE ACCESS, 2021, 9 : 91476 - 91486
  • [18] Aligning vision-language for graph inference in visual dialog
    Jiang, Tianling
    Shao, Hailin
    Tian, Xin
    Ji, Yi
    Liu, Chunping
    IMAGE AND VISION COMPUTING, 2021, 116
  • [19] Multi-Granularity Semantic Collaborative Reasoning Network for Visual Dialog
    Zhang, Hongwei
    Wang, Xiaojie
    Jiang, Si
    Li, Xuefeng
    APPLIED SCIENCES-BASEL, 2022, 12 (18):
  • [20] Counterfactual Visual Dialog: Robust Commonsense Knowledge Learning From Unbiased Training
    Liu, An-An
    Huang, Chenxi
    Xu, Ning
    Tian, Hongshuo
    Liu, Jing
    Zhang, Yongdong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 1639 - 1651