Unifying the Video and Question Attentions for Open-Ended Video Question Answering

被引:47
|
作者
Xue, Hongyang [1 ]
Zhao, Zhou [2 ]
Cai, Deng [1 ]
机构
[1] Zhejiang Univ, State Key Lab CAD&CG, Hangzhou 310027, Zhejiang, Peoples R China
[2] Zhejiang Univ, Coll Comp Sci, Hangzhou 310027, Zhejiang, Peoples R China
关键词
Video question answering; attention model; scene understanding;
D O I
10.1109/TIP.2017.2746267
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video question answering is an important task toward scene understanding and visual data retrieval. However, current visual question answering works mainly focus on a single static image, which is distinct from the dynamic and sequential visual data in the real world. Their approaches cannot utilize the temporal information in videos. In this paper, we introduce the task of free-form open-ended video question answering. The open-ended answers enable wider applications compared with the common multiple-choice tasks in Visual-QA. We first propose a data set for open-ended Video-QA with the automatic question generation approaches. Then, we propose our sequential video attention and temporal question attention models. These two models apply the attention mechanism on videos and questions, while preserving the sequential and temporal structures of the guides. The two models are integrated into the model of unified attention. After the video and the question are encoded, the answers are generated wordwisely from our models by a decoder. In the end, we evaluate our models on the proposed data set. The experimental results demonstrate the effectiveness of our proposed model.
引用
收藏
页码:5656 / 5666
页数:11
相关论文
共 50 条
  • [1] Knowledge-Constrained Answer Generation for Open-Ended Video Question Answering
    Jin, Yao
    Niu, Guocheng
    Xiao, Xinyan
    Zhang, Jian
    Peng, Xi
    Yu, Jun
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 7, 2023, : 8141 - 8149
  • [2] Open-Ended Multi-Modal Relational Reasoning for Video Question Answering
    Luo, Haozheng
    Qin, Ruiyang
    Xu, Chenwei
    Ye, Guo
    Luo, Zening
    2023 32ND IEEE INTERNATIONAL CONFERENCE ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION, RO-MAN, 2023, : 363 - 369
  • [3] Coarse to Fine Frame Selection for Online Open-ended Video Question Answering
    Nuthalapati, Sai Vidyaranya
    Tunga, Anirudh
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 353 - 361
  • [4] Open-Ended Video Question Answering via Multi-Modal Conditional Adversarial Networks
    Zhao, Zhou
    Xiao, Shuwen
    Song, Zehan
    Lu, Chujie
    Xiao, Jun
    Zhuang, Yueting
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 3859 - 3870
  • [5] AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering
    Chen, Xiuyuan
    Lin, Yuan
    Zhang, Yuchen
    Huang, Weiran
    COMPUTER VISION - ECCV 2024, PT XXXVII, 2025, 15095 : 179 - 195
  • [6] Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks
    Zhao, Zhou
    Zhang, Zhu
    Xiao, Shuwen
    Yu, Zhou
    Yu, Jun
    Cai, Deng
    Wu, Fei
    Zhuang, Yueting
    PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 3683 - 3689
  • [7] The Open-Ended Question
    Chapman-Novakofski, Karen
    JOURNAL OF NUTRITION EDUCATION AND BEHAVIOR, 2011, 43 (03) : 141 - 141
  • [8] Open-ended remote sensing visual question answering with transformers
    Al Rahhal, Mohamad M.
    Bazi, Yakoub
    Alsaleh, Sara O.
    Al-Razgan, Muna
    Mekhalfi, Mohamed Lamine
    Al Zuair, Mansour
    Alajlan, Naif
    INTERNATIONAL JOURNAL OF REMOTE SENSING, 2022, 43 (18) : 6809 - 6823
  • [9] Open-Vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
    Ko, Dohwan
    Lee, Ji Soo
    Choi, Miso
    Chu, Jaewon
    Park, Jihwan
    Kim, Hyunwoo J.
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3078 - 3089
  • [10] Open-Ended Long-Form Video Question Answering via Hierarchical Convolutional Self-Attention Networks
    Zhang, Zhu
    Zhao, Zhou
    Lin, Zhijie
    Song, Jingkuan
    He, Xiaofei
    PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 4383 - 4389