Multi-Granularity Interaction and Integration Network for Video Question Answering

被引：7

作者：

Wang, Yuanyuan ^{[1
]}

Liu, Meng ^{[2
]}

Wu, Jianlong ^{[3
]}

Nie, Liqiang ^{[3
]}

机构：

[1] Shandong Univ, Sch Comp Sci & Technol, Qingdao 266237, Peoples R China

[2] Shandong Jianzhu Univ, Sch Comp Sci & Technol, Jinan 250101, Peoples R China

[3] Harbin Inst Technol, Sch Comp Sci & Technol, Shenzhen 518055, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2023年 / 33卷 / 12期

基金：

中国国家自然科学基金;

关键词：

Question answering (information retrieval); Object oriented modeling; Video question answering; multi-granularity interaction modeling; long-tailed answers;

D O I：

10.1109/TCSVT.2023.3278492

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Video question answering, aiming to answer a natural language question related to the given video, has gained popularity in the last few years. Although significant improvements have been achieved, it is still confronted with two challenges: the sufficient comprehension of video content and the long-tailed answers. To this end, we propose a multi-granularity interaction and integration network for video question answering. It jointly explores multi-level intra-granularity and inter-granularity relations to enhance the comprehension of videos. To be specific, we first build a word-enhanced visual representation module to achieve cross-modal alignment. And then we advance a multi-granularity interaction module to explore the intra-granularity and inter-granularity relationships. Finally, a question-guided interaction module is developed to select question-related visual representations for answer prediction. In addition, we employ the seesaw loss for open-ended tasks to alleviate the long-tailed word distribution effect. Both the quantitative and qualitative results on TGIF-QA, MSRVTT-QA, and MSVD-QA datasets demonstrate the superiority of our model over several state-of-the-art approaches.

引用

页码：7684 / 7695

页数：12

共 49 条

[1] Cai JY, 2020, PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P998
[2] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Carreira, Joao
Zisserman, Andrew
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
[3] Chen XL, 2015, Arxiv, DOI arXiv:1504.00325
[4] Chowdhury MIH, 2018, IEEE IMAGE PROC, P599, DOI 10.1109/ICIP.2018.8451103
[5] Dang LH, 2021, PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, P636
[6] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[7] Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering
Fan, Chenyou
Zhang, Xiaofan
Zhang, Shu
Wang, Wensheng
Zhang, Chi
Huang, Heng
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1999 - 2007
[8] Fu T.-J., 2021, arXiv
[9] Motion-Appearance Co-Memory Networks for Video Question Answering
Gao, Jiyang
Ge, Runzhou
Chen, Kan
Nevatia, Ram
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6576 - 6585
[10] Gao LL, 2019, AAAI CONF ARTIF INTE, P6391

← 1 2 3 4 5 →