VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models

被引：0

作者：

Li, Shicheng ^{[1
]}

Li, Lei ^{[2
]}

Liu, Yi ^{[1
]}

Ren, Shuhuai ^{[1
]}

Liu, Yuanxin ^{[1
]}

Gao, Rundong ^{[1
]}

Sun, Xu ^{[1
]}

Hou, Lu ^{[3
]}

机构：

[1] Peking Univ, Sch Comp Sci, State Key Lab Multimedia Informat Proc, Beijing, Peoples R China

[2] Univ Hong Kong, Hong Kong, Peoples R China

[3] Huawei Noahs Ark Lab, Montreal, PQ, Canada

来源：

COMPUTER VISION - ECCV 2024, PT LXX | 2025年 / 15128卷

基金：

中国国家自然科学基金;

关键词：

Temporal understanding; Vision-language learning; Benchmark construction;

D O I：

10.1007/978-3-031-72897-6_19

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The ability to perceive how objects change over time is a crucial ingredient in human intelligence. However, current benchmarks cannot faithfully reflect the temporal understanding abilities of video-language models (VidLMs) due to the existence of static visual shortcuts. To remedy this issue, we present VITATECS, a diagnostic VIdeo-Text dAtaset for the evaluation of TEmporal Concept underStanding. Specifically, we first introduce a fine-grained taxonomy of temporal concepts in natural language in order to diagnose the capability of VidLMs to comprehend different temporal aspects. Furthermore, to disentangle the correlation between static and temporal information, we generate counterfactual video descriptions that differ from the original one only in the specified temporal aspect. We employ a semi-automatic data collection framework using large language models and human-in-the-loop annotation to obtain high-quality counterfactual descriptions efficiently. Evaluation of representative video-language understanding models confirms their deficiency in temporal understanding, revealing the need for greater emphasis on the temporal elements in video-language research. Our dataset is publicly available at https://github.com/lscpku/VITATECS.

引用

页码：331 / 348

页数：18

共 68 条

[1]

Bagad P, 2023, Arxiv, DOI arXiv:2301.02074

[2]

Bain M., 2022, arXiv

[3] SpeedNet: Learning the Speediness in Videos [J].

Benaim, Sagie ;

Ephrat, Ariel ;

Lang, Oran ;

Mosseri, Inbar ;

Freeman, William T. ;

Rubinstein, Michael ;

Irani, Michal ;

Dekel, Tali .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :9919-9928

[4]

Brown TB, 2020, ADV NEUR IN, V33

[5] Revisiting the "Video" in Video-Language Understanding [J].

Buch, Shyamal ;

Eyzaguirre, Cristobal ;

Gaidon, Adrien ;

Wu, Jiajun ;

Li Fei-Fei ;

Niebles, Juan Carlos .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :2907-2917

[6]

Chen David L., 2011, ANN M ASS COMP LING

[7]

Cheng F, 2022, Arxiv, DOI arXiv:2212.05051

[8]

Choi J., 2019, NEURAL INFORM PROCES

[9]

Diwan A., 2022, C EMP METH NAT LANG

[10]

Dong QX, 2024, Arxiv, DOI arXiv:2301.00234

← 1 2 3 4 5 6 7 →