Spatio-Temporal Graph Convolution Transformer for Video Question Answering

被引：0

作者：

Tang, Jiahao ^{[1
]}

Hu, Jianguo ^{[1
,2
]}

Huang, Wenjun ^{[1
]}

Shen, Shengzhi ^{[1
]}

Pan, Jiakai ^{[1
]}

Wang, Deming ^{[3
]}

Ding, Yanyu ^{[4
]}

机构：

[1] Sun Yat Sen Univ, Sch Microelect Sci & Technol, Zhuhai 519082, Peoples R China

[2] Sun Yat Sen Univ, Shenzhen Res Inst, Shenzhen 510275, Peoples R China

[3] South China Normal Univ, Sch Elect & Informat Engn, Foshan 528225, Peoples R China

[4] Dongguan Univ Technol, Dongguan 523820, Peoples R China

来源：

IEEE ACCESS | 2024年 / 12卷

关键词：

Visualization; Transformers; Feature extraction; Convolution; Computational modeling; Question answering (information retrieval); Data models; Video question answering (VideoQA); video reasoning and description; spatial-temporal graph; dynamic graph Transformer; graph attention; computer vision natural language processing;

D O I：

10.1109/ACCESS.2024.3445636

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Currently, video question answering (VideoQA) algorithms relying on video-text pretraining models employ intricate unimodal encoders and multimodal fusion Transformers, which often lead to decreased efficiency in tasks such as visual reasoning. Conversely, video question answering algorithms based on graph neural networks often exhibit suboptimal performance in video description and reasoning, attributed to their simplistic graph construction and cross-modal interaction designs, necessitating additional pretraining data to mitigate these performance disparities. In this work, we introduce the Spatio-temporal Graph Convolution Transformer (STCT) model for VideoQA. By leveraging Spatio-temporal Graph Convolution (STGC) and dynamic graph Transformers, our model explicitly captures the spatio-temporal relationships among visual objects, thereby facilitating dynamic interactions and enhancing visual reasoning capabilities. Moreover, our model introduces a novel cross-modal interaction approach utilizing dynamic graph attention mechanisms to adjust the attention weights of visual objects based on the posed question, thereby augmenting multimodal cooperative perception. By addressing the limitations of graph-based algorithms dependent on pretraining for performance enhancement through meticulously designed graph structures and cross-modal interaction mechanisms, our model achieves superior performance in visual description and reasoning tasks with simpler unimodal encoders and multimodal fusion modules. Comprehensive analyses and comparisons of the model's performance across multiple datasets, including NExT-QA, MSVD-QA, and MSRVTT-QA datasets, have confirmed its robust capabilities in video reasoning and description.

引用

页码：131664 / 131680

页数：17

共 50 条

[1] Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering
Cheng, Yi
Fan, Hehe
Lin, Dongyun
Sun, Ying
Kankanhalli, Mohan
Lim, Joo-Hwee
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 6131 - 6141
[2] Contrastive Video Question Answering via Video Graph Transformer
Xiao, Junbin
Zhou, Pan
Yao, Angela
Li, Yicong
Hong, Richang
Yan, Shuicheng
Chua, Tat-Seng
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) : 13265 - 13280
[3] Cross-Attentional Spatio-Temporal Semantic Graph Networks for Video Question Answering
Liu, Yun
Zhang, Xiaoming
Huang, Feiran
Zhang, Bo
Li, Zhoujun
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 1684 - 1696
[4] Event Graph Guided Compositional Spatial--Temporal Reasoning for Video Question Answering
Bai, Ziyi
Wang, Ruiping
Gao, Difei
Chen, Xilin
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1109 - 1121
[5] Point Spatio-Temporal Transformer Networks for Point Cloud Video Modeling
Fan, Hehe
Yang, Yi
Kankanhalli, Mohan
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (02) : 2181 - 2192
[6] Long-Interval Spatio-Temporal Graph Convolution for Brain Disease Diagnosis
Li, Shengrong
Zhu, Qi
Guan, Donghai
Shen, Bo
Zhang, Li
Ji, Yixin
Qi, Shile
Zhang, Daoqiang
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2025, 74
[7] Exploring Spatio–Temporal Graph Convolution for Video-Based Human–Object Interaction Recognition
Wang, Ning
Zhu, Guangming
Li, Hongsheng
Feng, Mingtao
Zhao, Xia
Ni, Lan
Shen, Peiyi
Mei, Lin
Zhang, Liang
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (10) : 5814 - 5827
[8] Cross-Frame Transformer-Based Spatio-Temporal Video Super-Resolution
Zhang, Wenhui
Zhou, Mingliang
Ji, Cheng
Sui, Xiubao
Bai, Junqi
IEEE TRANSACTIONS ON BROADCASTING, 2022, 68 (02) : 359 - 369
[9] TLNet: Temporal Span Localization Network With Collaborative Graph Reasoning for Video Question Answering
Liang, Lili
Sun, Guanglu
Li, Tianlin
Liu, Shuai
Ding, Weiping
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024,
[10] Adaptive Graph Convolution Neural Differential Equation for Spatio-Temporal Time Series Prediction
Han, Min
Wang, Qipeng
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2025, 37 (06) : 3193 - 3204

← 1 2 3 4 5 →