AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering

被引：0

作者：

Chen, Xiuyuan ^{[1
]}

Lin, Yuan ^{[3
]}

Zhang, Yuchen ^{[2
]}

Huang, Weiran ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, Qing Yuan Res Inst, MIFA Lab, SEIEE, Shanghai, Peoples R China

[2] ByteDance Res, Beijing, Peoples R China

[3] ByteDance Res, Shanghai, Peoples R China

来源：

COMPUTER VISION - ECCV 2024, PT XXXVII | 2025年 / 15095卷

关键词：

D O I：

10.1007/978-3-031-73113-6_11

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We propose a novel and challenging benchmark, AutoEval-Video, to comprehensively evaluate large vision-language models in open-ended video question answering. The comprehensiveness of AutoEval-Video is demonstrated in two aspects: 1) AutoEval-Video constructs open-ended video-questions across 9 skill dimensions, addressing capabilities of perception, comprehension, and generation; 2) AutoEval-Video contains newly collected videos that cover over 40 distinct themes. To efficiently evaluate responses to the open-ended questions, we employ an LLM-based evaluation approach, but instead of merely providing a reference answer, we annotate unique evaluation rules for every single instance (video-question pair). To maximize the robustness of these rules, we develop a novel adversarial annotation mechanism. By using instance-specific rules as prompt, GPT-4, as an automatic evaluator, can achieve a stable evaluation accuracy of around 97.0%, comparable to the 94.9% - 97.5% accuracy of a human evaluator. Furthermore, we assess the performance of eleven large vision-language models on AutoEval-Video. Among them, GPT-4V(ision) significantly outperforms other models, achieving an accuracy of 32.2%. However, there is still substantial room for improvement compared to human accuracy of 72.8%. By conducting an extensive case study, we uncover several drawbacks of GPT-4V, such as limited temporal and dynamic comprehension, and overly general responses. Code is available at https://github.com/Xiuyuan-Chen/AutoEval-Video.

引用

页码：179 / 195

页数：17

共 38 条

[1] Unifying the Video and Question Attentions for Open-Ended Video Question Answering
Xue, Hongyang
Zhao, Zhou
Cai, Deng
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2017, 26 (12) : 5656 - 5666
[2] Open-Vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
Ko, Dohwan
Lee, Ji Soo
Choi, Miso
Chu, Jaewon
Park, Jihwan
Kim, Hyunwoo J.
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3078 - 3089
[3] Knowledge-Constrained Answer Generation for Open-Ended Video Question Answering
Jin, Yao
Niu, Guocheng
Xiao, Xinyan
Zhang, Jian
Peng, Xi
Yu, Jun
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 7, 2023, : 8141 - 8149
[4] Open-Ended Multi-Modal Relational Reasoning for Video Question Answering
Luo, Haozheng
Qin, Ruiyang
Xu, Chenwei
Ye, Guo
Luo, Zening
2023 32ND IEEE INTERNATIONAL CONFERENCE ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION, RO-MAN, 2023, : 363 - 369
[5] Coarse to Fine Frame Selection for Online Open-ended Video Question Answering
Nuthalapati, Sai Vidyaranya
Tunga, Anirudh
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 353 - 361
[6] BVQA: Connecting Language and Vision Through Multimodal Attention for Open-Ended Question Answering
Bhuyan, Md. Shalha Mucha
Hossain, Eftekhar
Sathi, Khaleda Akhter
Hossain, Md. Azad
Dewan, M. Ali Akber
IEEE ACCESS, 2025, 13 : 27570 - 27586
[7] Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models
van Sonsbeek, Tom
Derakhshani, Mohammad Mahdi
Najdenkoska, Ivona
Snoek, Cees G. M.
Worring, Marcel
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT V, 2023, 14224 : 726 - 736
[8] Open-Ended Video Question Answering via Multi-Modal Conditional Adversarial Networks
Zhao, Zhou
Xiao, Shuwen
Song, Zehan
Lu, Chujie
Xiao, Jun
Zhuang, Yueting
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 3859 - 3870
[9] Large Language Models are Temporal and Causal Reasoners for Video Question Answering
Ko, Dohwan
Lee, Ji Soo
Kang, Wooyoung
Roh, Byungseok
Kim, Hyunwoo J.
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 4300 - 4316
[10] Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks
Zhao, Zhou
Zhang, Zhu
Xiao, Shuwen
Yu, Zhou
Yu, Jun
Cai, Deng
Wu, Fei
Zhuang, Yueting
PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 3683 - 3689

← 1 2 3 4 →