Video Question Answering with Procedural Programs

被引：0

作者：

Choudhury, Rohan ^{[1
]}

Niinuma, Koichiro ^{[2
]}

Kitani, Kris M. ^{[1
]}

Jeni, Laszlo A. ^{[1
]}

机构：

[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

[2] Fujitsu Res Amer, Santa Clara, CA USA

来源：

COMPUTER VISION-ECCV 2024, PT XXXVIII | 2025年 / 15096卷

关键词：

D O I：

10.1007/978-3-031-72920-1_18

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We propose to answer questions about videos by generating short procedural programs that solve visual subtasks to obtain a final answer. We present Procedural Video Querying (ProViQ), which uses a large language model to generate such programs from an input question and an API of visual modules in the prompt, then executes them to obtain the output. Recent similar procedural approaches have proven successful for image question answering, but cannot effectively or efficiently answer questions about videos due to their image-centric modules and lack of temporal reasoning ability. We address this by providing ProViQ with novel modules intended for video understanding, allowing it to generalize to a wide variety of videos with no additional training. As a result, ProViQ can efficiently find relevant moments in long videos, do causal and temporal reasoning, and summarize videos over long time horizons in order to answer complex questions. This code generation framework additionally enables ProViQ to perform other video tasks beyond question answering, such as multi-object tracking or basic video editing. ProViQ achieves state-of-the-art results on a diverse range of benchmarks, with improvements of up to 25% on short, long, open-ended, multiple-choice and multimodal video question-answering datasets. Our project page is at https://rccchoudhury.github.io/proviq2023/.

引用

页码：315 / 332

页数：18

共 50 条

[1] Affective question answering on video
Ruwa, Nelson
Mao, Qirong
Wang, Liangjun
Gou, Jianping
NEUROCOMPUTING, 2019, 363 : 125 - 139
[2] Video Graph Transformer for Video Question Answering
Xiao, Junbin
Zhou, Pan
Chua, Tat-Seng
Yan, Shuicheng
COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 39 - 58
[3] Video Reference: A Video Question Answering Engine
Gao, Lei
Li, Guangda
Zheng, Yan-Tao
Hong, Richang
Chua, Tat-Seng
ADVANCES IN MULTIMEDIA MODELING, PROCEEDINGS, 2010, 5916 : 799 - +
[4] Locate Before Answering: Answer Guided Question Localization for Video Question Answering
Qian, Tianwen
Cui, Ran
Chen, Jingjing
Peng, Pai
Guo, Xiaowei
Jiang, Yu-Gang
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 4554 - 4563
[5] Video Question Answering on Screencast Tutorials
Zhao, Wentian
Kim, Seokhwan
Xu, Ning
Jin, Hailin
PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 1061 - 1068
[6] Graph Guided Question Answer Generation for Procedural Question-Answering
Pham, Hai X.
Hadji, Isma
Xu, Xinnuo
Degutyte, Ziedune
Rainey, Jay
Kazakos, Evangelos
Fazly, Afsaneh
Tzimiropoulos, Georgios
Martinez, Brais
PROCEEDINGS OF THE 18TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 2501 - 2525
[7] Video Question Answering by Frame Attention
Fang, Jiannan
Sun, Lingling
Wang, Yaqi
ELEVENTH INTERNATIONAL CONFERENCE ON DIGITAL IMAGE PROCESSING (ICDIP 2019), 2019, 11179
[8] BERT Representations for Video Question Answering
Yang, Zekun
Garcia, Noa
Chu, Chenhui
Otani, Mayu
Nakashima, Yuta
Takemura, Haruo
2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 1545 - 1554
[9] Invariant Grounding for Video Question Answering
Li, Yicong
Wang, Xiang
Xiao, Junbin
Ji, Wei
Chua, Tat-Seng
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 2918 - 2927
[10] Open-Vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
Ko, Dohwan
Lee, Ji Soo
Choi, Miso
Chu, Jaewon
Park, Jihwan
Kim, Hyunwoo J.
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3078 - 3089

← 1 2 3 4 5 →