TVQA: Localized, Compositional Video Question Answering

被引:0
|
作者
Lei, Jie [1 ]
Yu, Licheng [1 ]
Bansal, Mohit [1 ]
Berg, Tamara L. [1 ]
机构
[1] Univ N Carolina, Dept Comp Sci, Chapel Hill, NC 27515 USA
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks. However, due to data limitations, there has been much less work on video-based QA. In this paper, we present TVQA, a largescale video QA dataset based on 6 popular TV shows. TVQA consists of 152,545 QA pairs from 21,793 clips, spanning over 460 hours of video. Questions are designed to be compositional in nature, requiring systems to jointly localize relevant moments within a clip, comprehend subtitle-based dialogue, and recognize relevant visual concepts. We provide analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVQA task. The dataset is publicly available at http://tvqa.cs.unc.edu.
引用
收藏
页码:1369 / 1379
页数:11
相关论文
共 50 条
  • [21] Maintaining Reasoning Consistency in Compositional Visual Question Answering
    Jing, Chenchen
    Jia, Yunde
    Wu, Yuwei
    Liu, Xinyu
    Wu, Qi
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 5089 - 5098
  • [22] Unifying the Video and Question Attentions for Open-Ended Video Question Answering
    Xue, Hongyang
    Zhao, Zhou
    Cai, Deng
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2017, 26 (12) : 5656 - 5666
  • [23] Equivariant and Invariant Grounding for Video Question Answering
    Li, Yicong
    Wang, Xiang
    Xiao, Junbin
    Chua, Tat-Seng
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4714 - 4722
  • [24] HIERARCHICAL RELATIONAL ATTENTION FOR VIDEO QUESTION ANSWERING
    Chowdhury, Muhammad Iqbal Hasan
    Kien Nguyen
    Sridharan, Sridha
    Fookes, Clinton
    2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2018, : 599 - 603
  • [25] VQuAD: Video Question Answering Diagnostic Dataset
    Gupta, Vivek
    Patro, Badri N.
    Parihar, Hemant
    Namboodiri, Vinay P.
    2022 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS (WACVW 2022), 2022, : 282 - 291
  • [26] Multichannel Attention Refinement for Video Question Answering
    Zhuang, Yueting
    Xu, Dejing
    Yan, Xin
    Cheng, Wenzhuo
    Zhao, Zhou
    Pu, Shiliang
    Xiao, Jun
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2020, 16 (01)
  • [27] Research Progress of Video Question Answering Technologies
    Bao C.
    Ding K.
    Dong J.
    Yang X.
    Xie M.
    Wang X.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2024, 61 (03): : 639 - 673
  • [28] CSA-BERT: Video Question Answering
    Jenni, Kommineni
    Srinivas, M.
    Sannapu, Roshni
    Perumal, Murukessan
    2023 IEEE STATISTICAL SIGNAL PROCESSING WORKSHOP, SSP, 2023, : 532 - 536
  • [29] Remember and forget: video and text fusion for video question answering
    Gao, Feng
    Ge, Yuanyuan
    Liu, Yongge
    MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (22) : 29269 - 29282
  • [30] Uncovering the Temporal Context for Video Question Answering
    Linchao Zhu
    Zhongwen Xu
    Yi Yang
    Alexander G. Hauptmann
    International Journal of Computer Vision, 2017, 124 : 409 - 421