Video Question Answering: A survey of the state-of-the-art

被引:0
作者
Jeshmol, P. J. [1 ]
Kovoor, Binsu C. [1 ]
机构
[1] Cochin Univ Sci & Technol, Div Informat Technol, Kochi, Kerala, India
关键词
Video Question Answering; Computer vision; Natural language processing; BENCHMARK; NETWORK;
D O I
10.1016/j.jvcir.2024.104320
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Video Question Answering (VideoQA) emerges as a prominent trend in the domain of Artificial Intelligence, Computer Vision, and Natural Language Processing. It involves developing systems capable of understanding, analyzing, and responding to questions about the content of videos. The Proposed survey presents an in-depth overview of the current landscape of Question Answering, shedding light on the challenges, methodologies, datasets, and innovative approaches in the domain. The key components of the Video Question Answering (VideoQA) framework include video feature extraction, question processing, reasoning, and response generation. It underscores the importance of datasets in shaping VideoQA research and the diversity of question types, from factual inquiries to spatial and temporal reasoning. The survey highlights the ongoing research directions and future prospects for VideoQA. Finally, the proposed survey gives a road map for future explorations at the intersection of multiple disciplines, emphasizing the ultimate objective of pushing the boundaries of knowledge and innovation.
引用
收藏
页数:14
相关论文
共 130 条
[1]   MMTF: Multi-Modal Temporal Fusion for Commonsense Video Question Answering [J].
Ahmad, Mobeen ;
Park, Geonwoo ;
Park, Dongchan ;
Park, Sanguk .
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, :4659-4664
[2]  
Alicioglu G., 2021, Comput. Graph., V102
[3]  
Anwer RM, 2023, arXiv, DOI DOI 10.48550/ARXIV.2307.13721
[4]   ViViT: A Video Vision Transformer [J].
Arnab, Anurag ;
Dehghani, Mostafa ;
Heigold, Georg ;
Sun, Chen ;
Lucic, Mario ;
Schmid, Cordelia .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826
[5]  
Bai ZY, 2023, ADV NEUR IN
[6]  
Baumli K, 2024, Arxiv, DOI arXiv:2312.09187
[7]  
Chen D., 2021, Keyword-aware multi-modal enhancement attention for video question answering, P128, DOI 10.11453507548.3507567
[8]   Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering [J].
Cheng, Yi ;
Fan, Hehe ;
Lin, Dongyun ;
Sun, Ying ;
Kankanhalli, Mohan ;
Lim, Joo-Hwee .
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 :6131-6141
[9]  
Choi M, 2024, Arxiv, DOI arXiv:2403.11021
[10]  
Colas Anthony., 2019, TutorialVQA: Question answering dataset for tutorial videos