Spatio-Temporal Graph Convolution Transformer for Video Question Answering

被引:0
|
作者
Tang, Jiahao [1 ]
Hu, Jianguo [1 ,2 ]
Huang, Wenjun [1 ]
Shen, Shengzhi [1 ]
Pan, Jiakai [1 ]
Wang, Deming [3 ]
Ding, Yanyu [4 ]
机构
[1] Sun Yat Sen Univ, Sch Microelect Sci & Technol, Zhuhai 519082, Peoples R China
[2] Sun Yat Sen Univ, Shenzhen Res Inst, Shenzhen 510275, Peoples R China
[3] South China Normal Univ, Sch Elect & Informat Engn, Foshan 528225, Peoples R China
[4] Dongguan Univ Technol, Dongguan 523820, Peoples R China
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Visualization; Transformers; Feature extraction; Convolution; Computational modeling; Question answering (information retrieval); Data models; Video question answering (VideoQA); video reasoning and description; spatial-temporal graph; dynamic graph Transformer; graph attention; computer vision natural language processing;
D O I
10.1109/ACCESS.2024.3445636
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Currently, video question answering (VideoQA) algorithms relying on video-text pretraining models employ intricate unimodal encoders and multimodal fusion Transformers, which often lead to decreased efficiency in tasks such as visual reasoning. Conversely, video question answering algorithms based on graph neural networks often exhibit suboptimal performance in video description and reasoning, attributed to their simplistic graph construction and cross-modal interaction designs, necessitating additional pretraining data to mitigate these performance disparities. In this work, we introduce the Spatio-temporal Graph Convolution Transformer (STCT) model for VideoQA. By leveraging Spatio-temporal Graph Convolution (STGC) and dynamic graph Transformers, our model explicitly captures the spatio-temporal relationships among visual objects, thereby facilitating dynamic interactions and enhancing visual reasoning capabilities. Moreover, our model introduces a novel cross-modal interaction approach utilizing dynamic graph attention mechanisms to adjust the attention weights of visual objects based on the posed question, thereby augmenting multimodal cooperative perception. By addressing the limitations of graph-based algorithms dependent on pretraining for performance enhancement through meticulously designed graph structures and cross-modal interaction mechanisms, our model achieves superior performance in visual description and reasoning tasks with simpler unimodal encoders and multimodal fusion modules. Comprehensive analyses and comparisons of the model's performance across multiple datasets, including NExT-QA, MSVD-QA, and MSRVTT-QA datasets, have confirmed its robust capabilities in video reasoning and description.
引用
收藏
页码:131664 / 131680
页数:17
相关论文
共 50 条
  • [1] Video Question Answering with Spatio-Temporal Reasoning
    Jang, Yunseok
    Song, Yale
    Kim, Chris Dongjoo
    Yu, Youngjae
    Kim, Youngjin
    Kim, Gunhee
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2019, 127 (10) : 1385 - 1412
  • [2] Video Question Answering with Spatio-Temporal Reasoning
    Yunseok Jang
    Yale Song
    Chris Dongjoo Kim
    Youngjae Yu
    Youngjin Kim
    Gunhee Kim
    International Journal of Computer Vision, 2019, 127 : 1385 - 1412
  • [3] Spatio-Temporal Context Networks for Video Question Answering
    Gao, Kun
    Han, Yahong
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2017, PT II, 2018, 10736 : 108 - 118
  • [4] Discovering Spatio-Temporal Rationales for Video Question Answering
    Li, Yicong
    Xiao, Junbin
    Feng, Chun
    Wang, Xiang
    Chua, Tat-Seng
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 13823 - 13832
  • [5] Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering
    Cheng, Yi
    Fan, Hehe
    Lin, Dongyun
    Sun, Ying
    Kankanhalli, Mohan
    Lim, Joo-Hwee
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 6131 - 6141
  • [6] Cross-Attentional Spatio-Temporal Semantic Graph Networks for Video Question Answering
    Liu, Yun
    Zhang, Xiaoming
    Huang, Feiran
    Zhang, Bo
    Li, Zhoujun
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 1684 - 1696
  • [7] Dynamic Spatio-Temporal Modular Network for Video Question Answering
    Qian, Zi
    Wang, Xin
    Duan, Xuguang
    Chen, Hong
    Zhu, Wenwu
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4466 - 4477
  • [8] Video Graph Transformer for Video Question Answering
    Xiao, Junbin
    Zhou, Pan
    Chua, Tat-Seng
    Yan, Shuicheng
    COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 39 - 58
  • [9] Spatio-Temporal Two-stage Fusion for video question answering
    Xu, Feifei
    Zhu, Yitao
    Wang, Chun
    Cao, Yangze
    Zhong, Zheng
    Li, Xiongmin
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 237
  • [10] Video Question Answering via Hierarchical Spatio-Temporal Attention Networks
    Zhao, Zhou
    Yang, Qifan
    Cai, Deng
    He, Xiaofei
    Zhuang, Yueting
    PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 3518 - 3524