Unifying the Video and Question Attentions for Open-Ended Video Question Answering

被引:47
|
作者
Xue, Hongyang [1 ]
Zhao, Zhou [2 ]
Cai, Deng [1 ]
机构
[1] Zhejiang Univ, State Key Lab CAD&CG, Hangzhou 310027, Zhejiang, Peoples R China
[2] Zhejiang Univ, Coll Comp Sci, Hangzhou 310027, Zhejiang, Peoples R China
关键词
Video question answering; attention model; scene understanding;
D O I
10.1109/TIP.2017.2746267
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video question answering is an important task toward scene understanding and visual data retrieval. However, current visual question answering works mainly focus on a single static image, which is distinct from the dynamic and sequential visual data in the real world. Their approaches cannot utilize the temporal information in videos. In this paper, we introduce the task of free-form open-ended video question answering. The open-ended answers enable wider applications compared with the common multiple-choice tasks in Visual-QA. We first propose a data set for open-ended Video-QA with the automatic question generation approaches. Then, we propose our sequential video attention and temporal question attention models. These two models apply the attention mechanism on videos and questions, while preserving the sequential and temporal structures of the guides. The two models are integrated into the model of unified attention. After the video and the question are encoded, the answers are generated wordwisely from our models by a decoder. In the end, we evaluate our models on the proposed data set. The experimental results demonstrate the effectiveness of our proposed model.
引用
收藏
页码:5656 / 5666
页数:11
相关论文
共 50 条
  • [31] Remember and forget: video and text fusion for video question answering
    Feng Gao
    Yuanyuan Ge
    Yongge Liu
    Multimedia Tools and Applications, 2018, 77 : 29269 - 29282
  • [32] Learning Question-Guided Video Representation for Multi-Turn Video Question Answering
    Chao, Guan-Lin
    Rastogi, Abhinav
    Yavuz, Semih
    Hakkani-Tur, Dilek
    Chen, Jindong
    Lane, Ian
    20TH ANNUAL MEETING OF THE SPECIAL INTEREST GROUP ON DISCOURSE AND DIALOGUE (SIGDIAL 2019), 2019, : 215 - 225
  • [33] Equivariant and Invariant Grounding for Video Question Answering
    Li, Yicong
    Wang, Xiang
    Xiao, Junbin
    Chua, Tat-Seng
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4714 - 4722
  • [34] TVQA: Localized, Compositional Video Question Answering
    Lei, Jie
    Yu, Licheng
    Bansal, Mohit
    Berg, Tamara L.
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 1369 - 1379
  • [35] HIERARCHICAL RELATIONAL ATTENTION FOR VIDEO QUESTION ANSWERING
    Chowdhury, Muhammad Iqbal Hasan
    Kien Nguyen
    Sridharan, Sridha
    Fookes, Clinton
    2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2018, : 599 - 603
  • [36] Research Progress of Video Question Answering Technologies
    Bao C.
    Ding K.
    Dong J.
    Yang X.
    Xie M.
    Wang X.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2024, 61 (03): : 639 - 673
  • [37] VQuAD: Video Question Answering Diagnostic Dataset
    Gupta, Vivek
    Patro, Badri N.
    Parihar, Hemant
    Namboodiri, Vinay P.
    2022 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS (WACVW 2022), 2022, : 282 - 291
  • [38] Multichannel Attention Refinement for Video Question Answering
    Zhuang, Yueting
    Xu, Dejing
    Yan, Xin
    Cheng, Wenzhuo
    Zhao, Zhou
    Pu, Shiliang
    Xiao, Jun
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2020, 16 (01)
  • [39] CSA-BERT: Video Question Answering
    Jenni, Kommineni
    Srinivas, M.
    Sannapu, Roshni
    Perumal, Murukessan
    2023 IEEE STATISTICAL SIGNAL PROCESSING WORKSHOP, SSP, 2023, : 532 - 536
  • [40] Uncovering the Temporal Context for Video Question Answering
    Linchao Zhu
    Zhongwen Xu
    Yi Yang
    Alexander G. Hauptmann
    International Journal of Computer Vision, 2017, 124 : 409 - 421