Spatio-Temporal Graph Convolution Transformer for Video Question Answering

被引:0
作者
Tang, Jiahao [1 ]
Hu, Jianguo [1 ,2 ]
Huang, Wenjun [1 ]
Shen, Shengzhi [1 ]
Pan, Jiakai [1 ]
Wang, Deming [3 ]
Ding, Yanyu [4 ]
机构
[1] Sun Yat Sen Univ, Sch Microelect Sci & Technol, Zhuhai 519082, Peoples R China
[2] Sun Yat Sen Univ, Shenzhen Res Inst, Shenzhen 510275, Peoples R China
[3] South China Normal Univ, Sch Elect & Informat Engn, Foshan 528225, Peoples R China
[4] Dongguan Univ Technol, Dongguan 523820, Peoples R China
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Visualization; Transformers; Feature extraction; Convolution; Computational modeling; Question answering (information retrieval); Data models; Video question answering (VideoQA); video reasoning and description; spatial-temporal graph; dynamic graph Transformer; graph attention; computer vision natural language processing;
D O I
10.1109/ACCESS.2024.3445636
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Currently, video question answering (VideoQA) algorithms relying on video-text pretraining models employ intricate unimodal encoders and multimodal fusion Transformers, which often lead to decreased efficiency in tasks such as visual reasoning. Conversely, video question answering algorithms based on graph neural networks often exhibit suboptimal performance in video description and reasoning, attributed to their simplistic graph construction and cross-modal interaction designs, necessitating additional pretraining data to mitigate these performance disparities. In this work, we introduce the Spatio-temporal Graph Convolution Transformer (STCT) model for VideoQA. By leveraging Spatio-temporal Graph Convolution (STGC) and dynamic graph Transformers, our model explicitly captures the spatio-temporal relationships among visual objects, thereby facilitating dynamic interactions and enhancing visual reasoning capabilities. Moreover, our model introduces a novel cross-modal interaction approach utilizing dynamic graph attention mechanisms to adjust the attention weights of visual objects based on the posed question, thereby augmenting multimodal cooperative perception. By addressing the limitations of graph-based algorithms dependent on pretraining for performance enhancement through meticulously designed graph structures and cross-modal interaction mechanisms, our model achieves superior performance in visual description and reasoning tasks with simpler unimodal encoders and multimodal fusion modules. Comprehensive analyses and comparisons of the model's performance across multiple datasets, including NExT-QA, MSVD-QA, and MSRVTT-QA datasets, have confirmed its robust capabilities in video reasoning and description.
引用
收藏
页码:131664 / 131680
页数:17
相关论文
共 50 条
  • [21] A Spatio-Temporal Enhanced Graph-Transformer AutoEncoder embedded pose for anomaly detection
    Zhu, Honglei
    Wei, Pengjuan
    Xu, Zhigang
    IET COMPUTER VISION, 2024, 18 (03) : 405 - 419
  • [22] Progressive Graph Attention Network for Video Question Answering
    Peng, Liang
    Yang, Shuangji
    Bin, Yi
    Wang, Guoqing
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2871 - 2879
  • [23] Video Text Tracking With a Spatio-Temporal Complementary Model
    Gao, Yuzhe
    Li, Xing
    Zhang, Jiajian
    Zhou, Yu
    Jin, Dian
    Wang, Jing
    Zhu, Shenggao
    Bai, Xiang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 9321 - 9331
  • [24] Spatio-Temporal Perturbations for Video Attribution
    Li, Zhenqiang
    Wang, Weimin
    Li, Zuoyue
    Huang, Yifei
    Sato, Yoichi
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (04) : 2043 - 2056
  • [25] Spatio-Temporal Context Graph Transformer Design for Map-Free Multi-Agent Trajectory Prediction
    Wang, Zhongning
    Zhang, Jianwei
    Chen, Jicheng
    Zhang, Hui
    IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, 2024, 9 (01): : 1369 - 1381
  • [26] Bilinear Graph Networks for Visual Question Answering
    Guo, Dalu
    Xu, Chang
    Tao, Dacheng
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (02) : 1023 - 1034
  • [27] FFSTIE: Video Restoration With Full-Frequency Spatio-Temporal Information Enhancement
    Lin, Liqun
    Wang, Jianhui
    Wei, Guangpeng
    Wang, Mingxing
    Zhang, Ang
    IEEE SIGNAL PROCESSING LETTERS, 2025, 32 : 571 - 575
  • [28] PointSDA: Spatio-Temporal Deformable Attention Network for Point Cloud Video Modeling
    Sheng, Xiaoxiao
    Shen, Zhiqiang
    Xiao, Gang
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (12): : 10946 - 10953
  • [29] Adaptive Spatio-Temporal Graph Enhanced Vision-Language Representation for Video QA
    Jin, Weike
    Zhao, Zhou
    Cao, Xiaochun
    Zhu, Jieming
    He, Xiuqiang
    Zhuang, Yueting
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 5477 - 5489
  • [30] Group Vehicle Trajectory Prediction With Global Spatio-Temporal Graph
    Xu, Dongwei
    Shang, Xuetian
    Liu, Yewanze
    Peng, Hang
    Li, Haijian
    IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, 2023, 8 (02): : 1219 - 1229