Graph-Based Multi-Interaction Network for Video Question Answering

被引:31
作者
Gu, Mao [1 ]
Zhao, Zhou [1 ,2 ]
Jin, Weike [1 ]
Hong, Richang [3 ]
Wu, Fei [1 ]
机构
[1] Zhejiang Univ, Coll Comp Sci, Hangzhou 310027, Peoples R China
[2] Zhejiang Univ, Alibaba Zhejiang Univ Joint Res Inst Frontier Tec, Hangzhou 310027, Peoples R China
[3] Hefei Univ Technol, Sch Comp & Informat, Hefei 230027, Anhui, Peoples R China
基金
中国国家自然科学基金; 浙江省自然科学基金;
关键词
Visualization; Knowledge discovery; Cats; Semantics; Task analysis; Image segmentation; Adaptation models; Video question answering; multi-interaction; graph-based relation-aware neural network; ARTIFICIAL-INTELLIGENCE; KNOWLEDGE;
D O I
10.1109/TIP.2021.3051756
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video question answering is an important task combining both Natural Language Processing and Computer Vision, which requires a machine to obtain a thorough understanding of the video. Most existing approaches simply capture spatio-temporal information in videos by using a combination of recurrent and convolutional neural networks. Nonetheless, most previous work focus on only salient frames or regions, which normally lacks some significant details, such as potential location and action relations. In this paper, we propose a new method called Graph-based Multi-interaction Network for video question answering. In our model, a new attention mechanism named multi-interaction is designed to capture both element-wise and segment-wise sequence interactions simultaneously, which can be found between and inside the multi-modal inputs. Moreover, we propose a graph-based relation-aware neural network to explore a more fine-grained visual representation, which could explore the relationships and dependencies between objects spatially and temporally. We evaluate our method on TGIF-QA and other two video QA datasets. The qualitative and quantitative experimental results show the effectiveness of our model, which achieves state-of-the-art performance.
引用
收藏
页码:2758 / 2770
页数:13
相关论文
共 79 条
[1]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[2]  
[Anonymous], 2017, IEEE INT C COMPUT VI, DOI [10.1109/iccv.201, DOI 10.1109/ICCV.2017.322]
[3]  
Bastings J., 2017, P 2017 C EMPIRICAL M, P1957
[4]  
Cho K., 2014, ARXIV14061078, DOI 10.3115/v1/D14-1179
[5]   Detecting Visual Relationships with Deep Relational Networks [J].
Dai, Bo ;
Zhang, Yuqi ;
Lin, Dahua .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :3298-3308
[6]   Learning Everything about Anything: Webly-Supervised Visual Concept Learning [J].
Divvala, Santosh K. ;
Farhadi, Ali ;
Guestrin, Carlos .
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :3270-3277
[7]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[8]   Temporality-enhanced knowledge memory network for factoid question answering [J].
Duan, Xin-yu ;
Tang, Si-liang ;
Zhang, Sheng-yu ;
Zhang, Yin ;
Zhao, Zhou ;
Xue, Jian-ru ;
Zhuang, Yue-ting ;
Wu, Fei .
FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2018, 19 (01) :104-115
[9]  
Fukui A, 2016, P C EMP METH NAT LAN, P457, DOI [10.18653/v1/d16-1044, DOI 10.18653/V1/D16-1044]
[10]  
Galleguillos C, 2008, PROC CVPR IEEE, P3552