Multi-Granularity Interaction and Integration Network for Video Question Answering

被引:7
作者
Wang, Yuanyuan [1 ]
Liu, Meng [2 ]
Wu, Jianlong [3 ]
Nie, Liqiang [3 ]
机构
[1] Shandong Univ, Sch Comp Sci & Technol, Qingdao 266237, Peoples R China
[2] Shandong Jianzhu Univ, Sch Comp Sci & Technol, Jinan 250101, Peoples R China
[3] Harbin Inst Technol, Sch Comp Sci & Technol, Shenzhen 518055, Peoples R China
基金
中国国家自然科学基金;
关键词
Question answering (information retrieval); Object oriented modeling; Video question answering; multi-granularity interaction modeling; long-tailed answers;
D O I
10.1109/TCSVT.2023.3278492
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Video question answering, aiming to answer a natural language question related to the given video, has gained popularity in the last few years. Although significant improvements have been achieved, it is still confronted with two challenges: the sufficient comprehension of video content and the long-tailed answers. To this end, we propose a multi-granularity interaction and integration network for video question answering. It jointly explores multi-level intra-granularity and inter-granularity relations to enhance the comprehension of videos. To be specific, we first build a word-enhanced visual representation module to achieve cross-modal alignment. And then we advance a multi-granularity interaction module to explore the intra-granularity and inter-granularity relationships. Finally, a question-guided interaction module is developed to select question-related visual representations for answer prediction. In addition, we employ the seesaw loss for open-ended tasks to alleviate the long-tailed word distribution effect. Both the quantitative and qualitative results on TGIF-QA, MSRVTT-QA, and MSVD-QA datasets demonstrate the superiority of our model over several state-of-the-art approaches.
引用
收藏
页码:7684 / 7695
页数:12
相关论文
共 49 条
  • [1] Cai JY, 2020, PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P998
  • [2] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [3] Chen XL, 2015, Arxiv, DOI arXiv:1504.00325
  • [4] Chowdhury MIH, 2018, IEEE IMAGE PROC, P599, DOI 10.1109/ICIP.2018.8451103
  • [5] Dang LH, 2021, PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, P636
  • [6] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
  • [7] Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering
    Fan, Chenyou
    Zhang, Xiaofan
    Zhang, Shu
    Wang, Wensheng
    Zhang, Chi
    Huang, Heng
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1999 - 2007
  • [8] Fu T.-J., 2021, arXiv
  • [9] Motion-Appearance Co-Memory Networks for Video Question Answering
    Gao, Jiyang
    Ge, Runzhou
    Chen, Kan
    Nevatia, Ram
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6576 - 6585
  • [10] Gao LL, 2019, AAAI CONF ARTIF INTE, P6391