Spatio-Temporal Graph Convolution Transformer for Video Question Answering

被引：0

作者：

Tang, Jiahao ^{[1
]}

Hu, Jianguo ^{[1
,2
]}

Huang, Wenjun ^{[1
]}

Shen, Shengzhi ^{[1
]}

Pan, Jiakai ^{[1
]}

Wang, Deming ^{[3
]}

Ding, Yanyu ^{[4
]}

机构：

[1] Sun Yat Sen Univ, Sch Microelect Sci & Technol, Zhuhai 519082, Peoples R China

[2] Sun Yat Sen Univ, Shenzhen Res Inst, Shenzhen 510275, Peoples R China

[3] South China Normal Univ, Sch Elect & Informat Engn, Foshan 528225, Peoples R China

[4] Dongguan Univ Technol, Dongguan 523820, Peoples R China

来源：

IEEE ACCESS | 2024年 / 12卷

关键词：

Visualization; Transformers; Feature extraction; Convolution; Computational modeling; Question answering (information retrieval); Data models; Video question answering (VideoQA); video reasoning and description; spatial-temporal graph; dynamic graph Transformer; graph attention; computer vision natural language processing;

D O I：

10.1109/ACCESS.2024.3445636

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Currently, video question answering (VideoQA) algorithms relying on video-text pretraining models employ intricate unimodal encoders and multimodal fusion Transformers, which often lead to decreased efficiency in tasks such as visual reasoning. Conversely, video question answering algorithms based on graph neural networks often exhibit suboptimal performance in video description and reasoning, attributed to their simplistic graph construction and cross-modal interaction designs, necessitating additional pretraining data to mitigate these performance disparities. In this work, we introduce the Spatio-temporal Graph Convolution Transformer (STCT) model for VideoQA. By leveraging Spatio-temporal Graph Convolution (STGC) and dynamic graph Transformers, our model explicitly captures the spatio-temporal relationships among visual objects, thereby facilitating dynamic interactions and enhancing visual reasoning capabilities. Moreover, our model introduces a novel cross-modal interaction approach utilizing dynamic graph attention mechanisms to adjust the attention weights of visual objects based on the posed question, thereby augmenting multimodal cooperative perception. By addressing the limitations of graph-based algorithms dependent on pretraining for performance enhancement through meticulously designed graph structures and cross-modal interaction mechanisms, our model achieves superior performance in visual description and reasoning tasks with simpler unimodal encoders and multimodal fusion modules. Comprehensive analyses and comparisons of the model's performance across multiple datasets, including NExT-QA, MSVD-QA, and MSRVTT-QA datasets, have confirmed its robust capabilities in video reasoning and description.

引用

页码：131664 / 131680

页数：17

共 50 条

[41] GSC: A Graph and Spatio-Temporal Continuity Based Framework for Accident Anticipation
Wang, Tianhang
Chen, Kai
Chen, Guang
Li, Bin
Li, Zhijun
Liu, Zhengfa
Jiang, Changjun
IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, 2024, 9 (01): : 2249 - 2261
[42] Bilaterally Slimmable Transformer for Elastic and Efficient Visual Question Answering
Yu, Zhou
Jin, Zitian
Yu, Jun
Xu, Mingliang
Wang, Hongbo
Fan, Jianping
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 9543 - 9556
[43] LoadSeer: Exploiting Tensor Graph Convolutional Network for Power Load Forecasting With Spatio-Temporal Characteristics
Zhang, Jiahao
Yu, Bin
Lai, Hanbin
Liu, Lin
Zhou, Jinghui
Lou, Fengliang
Ni, Yili
Peng, Yan
Yu, Ziheng
IEEE ACCESS, 2024, 12 : 190337 - 190346
[44] Spatio-Temporal Interaction Aware and Trajectory Distribution Aware Graph Convolution Network for Pedestrian Multimodal Trajectory Prediction
Wang, Ruiping
Song, Xiao
Hu, Zhijian
Cui, Yong
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2023, 72
[45] Spatio-Temporal Self-Attention Network for Video Saliency Prediction
Wang, Ziqiang
Liu, Zhi
Li, Gongyang
Wang, Yang
Zhang, Tianhong
Xu, Lihua
Wang, Jijun
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 1161 - 1174
[46] Histogram of Fuzzy Local Spatio-Temporal Descriptors for Video Action Recognition
Zuo, Zheming
Yang, Longzhi
Liu, Yonghuai
Chao, Fei
Song, Ran
Qu, Yanpeng
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2020, 16 (06) : 4059 - 4067
[47] Spatio-Temporal Contrastive Learning-Based Adaptive Graph Augmentation for Traffic Flow Prediction
Zhang, Dingkai
Wang, Pengfei
Ding, Lu
Wang, Xiaoling
He, Jifeng
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2025, 26 (01) : 1304 - 1318
[48] Fluxformer: Flow-Guided Duplex Attention Transformer via Spatio-Temporal Clustering for Action Recognition
Hong, Younggi
Kim, Min Ju
Lee, Isack
Yoo, Seok Bong
IEEE ROBOTICS AND AUTOMATION LETTERS, 2023, 8 (10) : 6411 - 6418
[49] An efficient approach for video retrieval by spatio-temporal features
Kumar, G. S. Naveen
Reddy, V. S. K.
INTERNATIONAL JOURNAL OF KNOWLEDGE-BASED AND INTELLIGENT ENGINEERING SYSTEMS, 2019, 23 (04) : 311 - 316
[50] Transformer-Based Spatio-Temporal Traffic Prediction for Access and Metro Networks
Wang, Fu
Xin, Xiangjun
Lei, Zhewei
Zhang, Qi
Yao, Haipeng
Wang, Xiaolong
Tian, Qinghua
Tian, Feng
JOURNAL OF LIGHTWAVE TECHNOLOGY, 2024, 42 (15) : 5204 - 5213

← 1 2 3 4 5 →