AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering

被引:0
|
作者
Chen, Xiuyuan [1 ]
Lin, Yuan [3 ]
Zhang, Yuchen [2 ]
Huang, Weiran [1 ]
机构
[1] Shanghai Jiao Tong Univ, Qing Yuan Res Inst, MIFA Lab, SEIEE, Shanghai, Peoples R China
[2] ByteDance Res, Beijing, Peoples R China
[3] ByteDance Res, Shanghai, Peoples R China
来源
COMPUTER VISION - ECCV 2024, PT XXXVII | 2025年 / 15095卷
关键词
D O I
10.1007/978-3-031-73113-6_11
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a novel and challenging benchmark, AutoEval-Video, to comprehensively evaluate large vision-language models in open-ended video question answering. The comprehensiveness of AutoEval-Video is demonstrated in two aspects: 1) AutoEval-Video constructs open-ended video-questions across 9 skill dimensions, addressing capabilities of perception, comprehension, and generation; 2) AutoEval-Video contains newly collected videos that cover over 40 distinct themes. To efficiently evaluate responses to the open-ended questions, we employ an LLM-based evaluation approach, but instead of merely providing a reference answer, we annotate unique evaluation rules for every single instance (video-question pair). To maximize the robustness of these rules, we develop a novel adversarial annotation mechanism. By using instance-specific rules as prompt, GPT-4, as an automatic evaluator, can achieve a stable evaluation accuracy of around 97.0%, comparable to the 94.9% - 97.5% accuracy of a human evaluator. Furthermore, we assess the performance of eleven large vision-language models on AutoEval-Video. Among them, GPT-4V(ision) significantly outperforms other models, achieving an accuracy of 32.2%. However, there is still substantial room for improvement compared to human accuracy of 72.8%. By conducting an extensive case study, we uncover several drawbacks of GPT-4V, such as limited temporal and dynamic comprehension, and overly general responses. Code is available at https://github.com/Xiuyuan-Chen/AutoEval-Video.
引用
收藏
页码:179 / 195
页数:17
相关论文
共 38 条
  • [1] Unifying the Video and Question Attentions for Open-Ended Video Question Answering
    Xue, Hongyang
    Zhao, Zhou
    Cai, Deng
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2017, 26 (12) : 5656 - 5666
  • [2] Open-Vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
    Ko, Dohwan
    Lee, Ji Soo
    Choi, Miso
    Chu, Jaewon
    Park, Jihwan
    Kim, Hyunwoo J.
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3078 - 3089
  • [3] Knowledge-Constrained Answer Generation for Open-Ended Video Question Answering
    Jin, Yao
    Niu, Guocheng
    Xiao, Xinyan
    Zhang, Jian
    Peng, Xi
    Yu, Jun
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 7, 2023, : 8141 - 8149
  • [4] Open-Ended Multi-Modal Relational Reasoning for Video Question Answering
    Luo, Haozheng
    Qin, Ruiyang
    Xu, Chenwei
    Ye, Guo
    Luo, Zening
    2023 32ND IEEE INTERNATIONAL CONFERENCE ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION, RO-MAN, 2023, : 363 - 369
  • [5] Coarse to Fine Frame Selection for Online Open-ended Video Question Answering
    Nuthalapati, Sai Vidyaranya
    Tunga, Anirudh
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 353 - 361
  • [6] BVQA: Connecting Language and Vision Through Multimodal Attention for Open-Ended Question Answering
    Bhuyan, Md. Shalha Mucha
    Hossain, Eftekhar
    Sathi, Khaleda Akhter
    Hossain, Md. Azad
    Dewan, M. Ali Akber
    IEEE ACCESS, 2025, 13 : 27570 - 27586
  • [7] Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models
    van Sonsbeek, Tom
    Derakhshani, Mohammad Mahdi
    Najdenkoska, Ivona
    Snoek, Cees G. M.
    Worring, Marcel
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT V, 2023, 14224 : 726 - 736
  • [8] Open-Ended Video Question Answering via Multi-Modal Conditional Adversarial Networks
    Zhao, Zhou
    Xiao, Shuwen
    Song, Zehan
    Lu, Chujie
    Xiao, Jun
    Zhuang, Yueting
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 3859 - 3870
  • [9] Large Language Models are Temporal and Causal Reasoners for Video Question Answering
    Ko, Dohwan
    Lee, Ji Soo
    Kang, Wooyoung
    Roh, Byungseok
    Kim, Hyunwoo J.
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 4300 - 4316
  • [10] Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks
    Zhao, Zhou
    Zhang, Zhu
    Xiao, Shuwen
    Yu, Zhou
    Yu, Jun
    Cai, Deng
    Wu, Fei
    Zhuang, Yueting
    PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 3683 - 3689