Unifying the Video and Question Attentions for Open-Ended Video Question Answering

被引：47

作者：

Xue, Hongyang ^{[1
]}

Zhao, Zhou ^{[2
]}

Cai, Deng ^{[1
]}

机构：

[1] Zhejiang Univ, State Key Lab CAD&CG, Hangzhou 310027, Zhejiang, Peoples R China

[2] Zhejiang Univ, Coll Comp Sci, Hangzhou 310027, Zhejiang, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2017年 / 26卷 / 12期

关键词：

Video question answering; attention model; scene understanding;

D O I：

10.1109/TIP.2017.2746267

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video question answering is an important task toward scene understanding and visual data retrieval. However, current visual question answering works mainly focus on a single static image, which is distinct from the dynamic and sequential visual data in the real world. Their approaches cannot utilize the temporal information in videos. In this paper, we introduce the task of free-form open-ended video question answering. The open-ended answers enable wider applications compared with the common multiple-choice tasks in Visual-QA. We first propose a data set for open-ended Video-QA with the automatic question generation approaches. Then, we propose our sequential video attention and temporal question attention models. These two models apply the attention mechanism on videos and questions, while preserving the sequential and temporal structures of the guides. The two models are integrated into the model of unified attention. After the video and the question are encoded, the answers are generated wordwisely from our models by a decoder. In the end, we evaluate our models on the proposed data set. The experimental results demonstrate the effectiveness of our proposed model.

引用

页码：5656 / 5666

页数：11

共 50 条

[31] Remember and forget: video and text fusion for video question answering
Feng Gao
Yuanyuan Ge
Yongge Liu
Multimedia Tools and Applications, 2018, 77 : 29269 - 29282
[32] Learning Question-Guided Video Representation for Multi-Turn Video Question Answering
Chao, Guan-Lin
Rastogi, Abhinav
Yavuz, Semih
Hakkani-Tur, Dilek
Chen, Jindong
Lane, Ian
20TH ANNUAL MEETING OF THE SPECIAL INTEREST GROUP ON DISCOURSE AND DIALOGUE (SIGDIAL 2019), 2019, : 215 - 225
[33] Equivariant and Invariant Grounding for Video Question Answering
Li, Yicong
Wang, Xiang
Xiao, Junbin
Chua, Tat-Seng
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4714 - 4722
[34] TVQA: Localized, Compositional Video Question Answering
Lei, Jie
Yu, Licheng
Bansal, Mohit
Berg, Tamara L.
2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 1369 - 1379
[35] HIERARCHICAL RELATIONAL ATTENTION FOR VIDEO QUESTION ANSWERING
Chowdhury, Muhammad Iqbal Hasan
Kien Nguyen
Sridharan, Sridha
Fookes, Clinton
2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2018, : 599 - 603
[36] Research Progress of Video Question Answering Technologies
Bao C.
Ding K.
Dong J.
Yang X.
Xie M.
Wang X.
Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2024, 61 (03): : 639 - 673
[37] VQuAD: Video Question Answering Diagnostic Dataset
Gupta, Vivek
Patro, Badri N.
Parihar, Hemant
Namboodiri, Vinay P.
2022 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS (WACVW 2022), 2022, : 282 - 291
[38] Multichannel Attention Refinement for Video Question Answering
Zhuang, Yueting
Xu, Dejing
Yan, Xin
Cheng, Wenzhuo
Zhao, Zhou
Pu, Shiliang
Xiao, Jun
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2020, 16 (01)
[39] CSA-BERT: Video Question Answering
Jenni, Kommineni
Srinivas, M.
Sannapu, Roshni
Perumal, Murukessan
2023 IEEE STATISTICAL SIGNAL PROCESSING WORKSHOP, SSP, 2023, : 532 - 536
[40] Uncovering the Temporal Context for Video Question Answering
Linchao Zhu
Zhongwen Xu
Yi Yang
Alexander G. Hauptmann
International Journal of Computer Vision, 2017, 124 : 409 - 421

← 1 2 3 4 5 →