Hybrid Graph Reasoning With Dynamic Interaction for Visual Dialog

被引：2

作者：

Du, Shanshan ^{[1
,2
]}

Wang, Hanli ^{[1
,2
]}

Li, Tengpeng ^{[1
,2
]}

Chen, Chang Wen ^{[3
]}

机构：

[1] Tongji Univ, Dept Comp Sci & Technol, Shanghai 201804, Peoples R China

[2] Tongji Univ, Serv Comp, Minist Educ, Key Lab Embedded Syst, Shanghai 200092, Peoples R China

[3] Hong Kong Polytech Univ, Dept Comp, Hong Kong, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2024年 / 26卷

基金：

中国国家自然科学基金;

关键词：

Visualization; Cognition; Semantics; Task analysis; Routing; History; Transformers; Cross-modal interaction; dynamic routing; graph neural network; graph reasoning; visual dialog;

D O I：

10.1109/TMM.2024.3385997

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

As a pivotal branch of intelligent human-computer interaction, visual dialog is a technically challenging task that requires artificial intelligence (AI) agents to answer consecutive questions based on image content and history dialog. Despite considerable progresses, visual dialog still suffers from two major problems: (1) how to design flexible cross-modal interaction patterns instead of over-reliance on expert experience and (2) how to infer underlying semantic dependencies between dialogues effectively. To address these issues, an end-to-end framework employing dynamic interaction and hybrid graph reasoning is proposed in this work. Specifically, three major components are designed and the practical benefits are demonstrated by extensive experiments. First, a dynamic interaction module is developed to automatically determine the optimal modality interaction route for multifarious questions, which consists of three elaborate functional interaction blocks endowed with dynamic routers. Second, a hybrid graph reasoning module is designed to explore adequate semantic associations between dialogues from multiple perspectives, where the hybrid graph is constructed by aggregating a structured coreference graph and a context-aware temporal graph. Third, a unified one-stage visual dialog model with an end-to-end structure is developed to train the dynamic interaction module and the hybrid graph reasoning module in a collaborative manner. Extensive experiments on the benchmark datasets of VisDial v0.9 and VisDial v1.0 demonstrate the effectiveness of the proposed method compared to other state-of-the-art approaches.

引用

页码：9095 / 9108

页数：14

共 50 条

[41] Learning to Agree on Vision Attention for Visual Commonsense Reasoning [J].

Li, Zhenyang ;

Guo, Yangyang ;

Wang, Kejie ;

Liu, Fan ;

Nie, Liqiang ;

Kankanhalli, Mohan .

IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 :1065-1075

[42] Relation Inference Enhancement Network for Visual Commonsense Reasoning [J].

Yuan, Mengqi ;

Jia, Gengyun ;

Bao, Bing-Kun .

IEEE TRANSACTIONS ON MULTIMEDIA, 2025, 27 :2221-2231

[43] HoGRN: Explainable Sparse Knowledge Graph Completion via High-Order Graph Reasoning Network [J].

Chen, Weijian ;

Cao, Yixin ;

Feng, Fuli ;

He, Xiangnan ;

Zhang, Yongdong .

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (12) :8462-8475

[44] Interpretable Visual Question Answering by Reasoning on Dependency Trees [J].

Cao, Qingxing ;

Liang, Xiaodan ;

Li, Bailin ;

Lin, Liang .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (03) :887-901

[45] A Modern Take on Visual Relationship Reasoning for Grasp Planning [J].

Rabino, Paolo ;

Tommasi, Tatiana .

IEEE ROBOTICS AND AUTOMATION LETTERS, 2025, 10 (02) :1712-1719

[46] A Weighted Heterogeneous Graph-Based Dialog System [J].

Zhao, Xinyan ;

Chen, Liangwei ;

Chen, Huanhuan .

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (08) :5212-5217

[47] VRC-GraphNet: A Graph Neural Network-Based Reasoning Framework for Attacking Visual Reasoning Captchas [J].

Xu, Botao ;

Wang, Haizhou .

SECURITY AND PRIVACY IN COMMUNICATION NETWORKS, PT I, SECURECOMM 2023, 2025, 567 :165-184

[48] Hybrid CNN-Transformer Features for Visual Place Recognition [J].

Wang, Yuwei ;

Qiu, Yuanying ;

Cheng, Peitao ;

Zhang, Junyu .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (03) :1109-1122

[49] Hierarchical Dynamic Graph Clustering Network [J].

Chen, Jie ;

Jiao, Licheng ;

Liu, Xu ;

Li, Lingling ;

Liu, Fang ;

Chen, Puhua ;

Yang, Shuyuan ;

Hou, Biao .

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (09) :4722-4735

[50] Multi-Level Knowledge Injecting for Visual Commonsense Reasoning [J].

Wen, Zhang ;

Peng, Yuxin .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (03) :1042-1054

← 1 2 3 4 5 →