Revisiting the "Video" in Video-Language Understanding

被引：75

作者：

Buch, Shyamal ^{[1
]}

Eyzaguirre, Cristobal ^{[1
]}

Gaidon, Adrien ^{[2
]}

Wu, Jiajun ^{[1
]}

Li Fei-Fei ^{[1
]}

Niebles, Juan Carlos ^{[1
]}

机构：

[1] Stanford Univ, Stanford, CA 94305 USA

[2] Toyota Res Inst, Stanford, CA USA

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年

关键词：

D O I：

10.1109/CVPR52688.2022.00293

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

What makes a video task uniquely suited for videos, beyond what can be understood from a single image? Building on recent progress in self-supervised image-language models, we revisit this question in the context of video and language tasks. We propose the atemporal probe (ATP), a new model for video-language analysis which provides a stronger bound on the baseline accuracy of multimodal models constrained by image-level understanding. By applying this model to standard discriminative video and language tasks, such as video question answering and text-to-video retrieval, we characterize the limitations and potential of current video-language benchmarks. We find that understanding of event temporality is often not necessary to achieve strong or state-of-the-art performance, even compared with recent large-scale video-language models and in contexts intended to benchmark deeper video-level understanding. We also demonstrate how ATP can improve both video-language dataset and model design. We describe a technique for leveraging ATP to better disentangle dataset subsets with a higher concentration of temporally challenging data, improving benchmarking efficacy for causal and temporal understanding. Further, we show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.(1).

引用

页码：2907 / 2917

页数：11

共 62 条

[1]

[Anonymous], 2020, EUR C COMP VIS

[2]

[Anonymous], 2019, CVPR, DOI DOI 10.1109/CVPR.2019.00137

[3]

[Anonymous], 2017, CVPR, DOI DOI 10.1109/CVPR.2017.435

[4]

[Anonymous], 2015, Microsoft COCO captions: Data collection and evaluation server

[5]

Antoine SM, 2019, EUR HEART J-CASE REP, V3

[6] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[7]

Bain Max, 2021, ARXIV210400650

[8] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[9] Large Scale Holistic Video Understanding [J].

Diba, Ali ;

Fayyaz, Mohsen ;

Sharma, Vivek ;

Paluri, Manohar ;

Gall, Jurgen ;

Stiefelhagen, Rainer ;

Van Gool, Luc .

COMPUTER VISION - ECCV 2020, PT V, 2020, 12350 :593-610

[10]

Dosovitskiy A, 2020, ARXIV

← 1 2 3 4 5 6 7 →