A Universal Quaternion Hypergraph Network for Multimodal Video Question Answering

被引:18
作者
Guo, Zhicheng [1 ]
Zhao, Jiaxuan [1 ]
Jiao, Licheng [1 ]
Liu, Xu [1 ]
Liu, Fang [1 ]
机构
[1] Xidian Univ, Int Res Ctr Intelligent Percept & Computat, Sch Artificial Intelligence, Key Lab Intelligent Percept & Image Understanding,, Xian 710071, Peoples R China
基金
中国国家自然科学基金;
关键词
Quaternions; Task analysis; Cognition; Visualization; Knowledge discovery; Feature extraction; Convolution; Video question answering; multimodal features; quaternion operations; hypergraph convolution;
D O I
10.1109/TMM.2021.3120544
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Fusion and interaction of multimodal features are essential for video question answering. Structural information composed of the relationships between different objects in videos is very complex, which restricts understanding and reasoning. In this paper, we propose a quaternion hypergraph network (QHGN) for multimodal video question answering, to simultaneously involve multimodal features and structural information. Since quaternion operations are suitable for multimodal interactions, four components of the quaternion vectors are applied to represent the multimodal features. Furthermore, we construct a hypergraph based on the visual objects detected in the video. Most importantly, the quaternion hypergraph convolution operator is theoretically derived to realize multimodal and relational reasoning. Question and candidate answers are embedded in quaternion space, and a Q & A reasoning module is creatively designed for selecting the answer accurately. Moreover, the unified framework can be extended to other video-text tasks with different quaternion decoders. Experimental evaluations on the TVQA dataset and DramaQA dataset show that our method achieves state-of-the-art performance.
引用
收藏
页码:38 / 49
页数:12
相关论文
共 67 条
[1]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[2]   Hypergraph convolution and hypergraph attention [J].
Bai, Song ;
Zhang, Feihu ;
Torr, Philip H. S. .
PATTERN RECOGNITION, 2021, 110
[3]  
Bebensee B., 2020, PROC IEEE INT C ACOU, P4005
[4]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[5]  
Chadha A., 2021, WINT C APPL COMP VIS, P1
[6]   Generalizing the hypergraph Laplacian via a diffusion process with mediators [J].
Chan, T-H Hubert ;
Liang, Zhibin .
THEORETICAL COMPUTER SCIENCE, 2020, 806 :416-428
[7]  
Chitra U., 2019, PR MACH LEARN RES, P1172
[8]  
Choi S, 2021, AAAI CONF ARTIF INTE, V35, P1166
[9]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[10]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171