Spatio-Temporal Graph Convolution Transformer for Video Question Answering

被引:0
作者
Tang, Jiahao [1 ]
Hu, Jianguo [1 ,2 ]
Huang, Wenjun [1 ]
Shen, Shengzhi [1 ]
Pan, Jiakai [1 ]
Wang, Deming [3 ]
Ding, Yanyu [4 ]
机构
[1] Sun Yat Sen Univ, Sch Microelect Sci & Technol, Zhuhai 519082, Peoples R China
[2] Sun Yat Sen Univ, Shenzhen Res Inst, Shenzhen 510275, Peoples R China
[3] South China Normal Univ, Sch Elect & Informat Engn, Foshan 528225, Peoples R China
[4] Dongguan Univ Technol, Dongguan 523820, Peoples R China
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Visualization; Transformers; Feature extraction; Convolution; Computational modeling; Question answering (information retrieval); Data models; Video question answering (VideoQA); video reasoning and description; spatial-temporal graph; dynamic graph Transformer; graph attention; computer vision natural language processing;
D O I
10.1109/ACCESS.2024.3445636
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Currently, video question answering (VideoQA) algorithms relying on video-text pretraining models employ intricate unimodal encoders and multimodal fusion Transformers, which often lead to decreased efficiency in tasks such as visual reasoning. Conversely, video question answering algorithms based on graph neural networks often exhibit suboptimal performance in video description and reasoning, attributed to their simplistic graph construction and cross-modal interaction designs, necessitating additional pretraining data to mitigate these performance disparities. In this work, we introduce the Spatio-temporal Graph Convolution Transformer (STCT) model for VideoQA. By leveraging Spatio-temporal Graph Convolution (STGC) and dynamic graph Transformers, our model explicitly captures the spatio-temporal relationships among visual objects, thereby facilitating dynamic interactions and enhancing visual reasoning capabilities. Moreover, our model introduces a novel cross-modal interaction approach utilizing dynamic graph attention mechanisms to adjust the attention weights of visual objects based on the posed question, thereby augmenting multimodal cooperative perception. By addressing the limitations of graph-based algorithms dependent on pretraining for performance enhancement through meticulously designed graph structures and cross-modal interaction mechanisms, our model achieves superior performance in visual description and reasoning tasks with simpler unimodal encoders and multimodal fusion modules. Comprehensive analyses and comparisons of the model's performance across multiple datasets, including NExT-QA, MSVD-QA, and MSRVTT-QA datasets, have confirmed its robust capabilities in video reasoning and description.
引用
收藏
页码:131664 / 131680
页数:17
相关论文
共 50 条
  • [41] GSC: A Graph and Spatio-Temporal Continuity Based Framework for Accident Anticipation
    Wang, Tianhang
    Chen, Kai
    Chen, Guang
    Li, Bin
    Li, Zhijun
    Liu, Zhengfa
    Jiang, Changjun
    IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, 2024, 9 (01): : 2249 - 2261
  • [42] Bilaterally Slimmable Transformer for Elastic and Efficient Visual Question Answering
    Yu, Zhou
    Jin, Zitian
    Yu, Jun
    Xu, Mingliang
    Wang, Hongbo
    Fan, Jianping
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 9543 - 9556
  • [43] LoadSeer: Exploiting Tensor Graph Convolutional Network for Power Load Forecasting With Spatio-Temporal Characteristics
    Zhang, Jiahao
    Yu, Bin
    Lai, Hanbin
    Liu, Lin
    Zhou, Jinghui
    Lou, Fengliang
    Ni, Yili
    Peng, Yan
    Yu, Ziheng
    IEEE ACCESS, 2024, 12 : 190337 - 190346
  • [44] Spatio-Temporal Interaction Aware and Trajectory Distribution Aware Graph Convolution Network for Pedestrian Multimodal Trajectory Prediction
    Wang, Ruiping
    Song, Xiao
    Hu, Zhijian
    Cui, Yong
    IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2023, 72
  • [45] Spatio-Temporal Self-Attention Network for Video Saliency Prediction
    Wang, Ziqiang
    Liu, Zhi
    Li, Gongyang
    Wang, Yang
    Zhang, Tianhong
    Xu, Lihua
    Wang, Jijun
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 1161 - 1174
  • [46] Histogram of Fuzzy Local Spatio-Temporal Descriptors for Video Action Recognition
    Zuo, Zheming
    Yang, Longzhi
    Liu, Yonghuai
    Chao, Fei
    Song, Ran
    Qu, Yanpeng
    IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2020, 16 (06) : 4059 - 4067
  • [47] Spatio-Temporal Contrastive Learning-Based Adaptive Graph Augmentation for Traffic Flow Prediction
    Zhang, Dingkai
    Wang, Pengfei
    Ding, Lu
    Wang, Xiaoling
    He, Jifeng
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2025, 26 (01) : 1304 - 1318
  • [48] Fluxformer: Flow-Guided Duplex Attention Transformer via Spatio-Temporal Clustering for Action Recognition
    Hong, Younggi
    Kim, Min Ju
    Lee, Isack
    Yoo, Seok Bong
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2023, 8 (10) : 6411 - 6418
  • [49] An efficient approach for video retrieval by spatio-temporal features
    Kumar, G. S. Naveen
    Reddy, V. S. K.
    INTERNATIONAL JOURNAL OF KNOWLEDGE-BASED AND INTELLIGENT ENGINEERING SYSTEMS, 2019, 23 (04) : 311 - 316
  • [50] Transformer-Based Spatio-Temporal Traffic Prediction for Access and Metro Networks
    Wang, Fu
    Xin, Xiangjun
    Lei, Zhewei
    Zhang, Qi
    Yao, Haipeng
    Wang, Xiaolong
    Tian, Qinghua
    Tian, Feng
    JOURNAL OF LIGHTWAVE TECHNOLOGY, 2024, 42 (15) : 5204 - 5213