Unifying the Video and Question Attentions for Open-Ended Video Question Answering

被引:47
|
作者
Xue, Hongyang [1 ]
Zhao, Zhou [2 ]
Cai, Deng [1 ]
机构
[1] Zhejiang Univ, State Key Lab CAD&CG, Hangzhou 310027, Zhejiang, Peoples R China
[2] Zhejiang Univ, Coll Comp Sci, Hangzhou 310027, Zhejiang, Peoples R China
关键词
Video question answering; attention model; scene understanding;
D O I
10.1109/TIP.2017.2746267
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video question answering is an important task toward scene understanding and visual data retrieval. However, current visual question answering works mainly focus on a single static image, which is distinct from the dynamic and sequential visual data in the real world. Their approaches cannot utilize the temporal information in videos. In this paper, we introduce the task of free-form open-ended video question answering. The open-ended answers enable wider applications compared with the common multiple-choice tasks in Visual-QA. We first propose a data set for open-ended Video-QA with the automatic question generation approaches. Then, we propose our sequential video attention and temporal question attention models. These two models apply the attention mechanism on videos and questions, while preserving the sequential and temporal structures of the guides. The two models are integrated into the model of unified attention. After the video and the question are encoded, the answers are generated wordwisely from our models by a decoder. In the end, we evaluate our models on the proposed data set. The experimental results demonstrate the effectiveness of our proposed model.
引用
收藏
页码:5656 / 5666
页数:11
相关论文
共 50 条
  • [21] Video Question Answering by Frame Attention
    Fang, Jiannan
    Sun, Lingling
    Wang, Yaqi
    ELEVENTH INTERNATIONAL CONFERENCE ON DIGITAL IMAGE PROCESSING (ICDIP 2019), 2019, 11179
  • [22] Video Question Answering with Procedural Programs
    Choudhury, Rohan
    Niinuma, Koichiro
    Kitani, Kris M.
    Jeni, Laszlo A.
    COMPUTER VISION-ECCV 2024, PT XXXVIII, 2025, 15096 : 315 - 332
  • [23] Invariant Grounding for Video Question Answering
    Li, Yicong
    Wang, Xiang
    Xiao, Junbin
    Ji, Wei
    Chua, Tat-Seng
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 2918 - 2927
  • [24] BERT Representations for Video Question Answering
    Yang, Zekun
    Garcia, Noa
    Chu, Chenhui
    Otani, Mayu
    Nakashima, Yuta
    Takemura, Haruo
    2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 1545 - 1554
  • [25] Locate Before Answering: Answer Guided Question Localization for Video Question Answering
    Qian, Tianwen
    Cui, Ran
    Chen, Jingjing
    Peng, Pai
    Guo, Xiaowei
    Jiang, Yu-Gang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 4554 - 4563
  • [26] Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models
    van Sonsbeek, Tom
    Derakhshani, Mohammad Mahdi
    Najdenkoska, Ivona
    Snoek, Cees G. M.
    Worring, Marcel
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT V, 2023, 14224 : 726 - 736
  • [27] BVQA: Connecting Language and Vision Through Multimodal Attention for Open-Ended Question Answering
    Bhuyan, Md. Shalha Mucha
    Hossain, Eftekhar
    Sathi, Khaleda Akhter
    Hossain, Md. Azad
    Dewan, M. Ali Akber
    IEEE ACCESS, 2025, 13 : 27570 - 27586
  • [28] Structured Attentions for Visual Question Answering
    Zhu, Chen
    Zhao, Yanpeng
    Huang, Shuaiyi
    Tu, Kewei
    Ma, Yi
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1300 - 1309
  • [29] Contrastive Video Question Answering via Video Graph Transformer
    Xiao, Junbin
    Zhou, Pan
    Yao, Angela
    Li, Yicong
    Hong, Richang
    Yan, Shuicheng
    Chua, Tat-Seng
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) : 13265 - 13280
  • [30] Remember and forget: video and text fusion for video question answering
    Gao, Feng
    Ge, Yuanyuan
    Liu, Yongge
    MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (22) : 29269 - 29282