Revisiting the "Video" in Video-Language Understanding

被引:75
作者
Buch, Shyamal [1 ]
Eyzaguirre, Cristobal [1 ]
Gaidon, Adrien [2 ]
Wu, Jiajun [1 ]
Li Fei-Fei [1 ]
Niebles, Juan Carlos [1 ]
机构
[1] Stanford Univ, Stanford, CA 94305 USA
[2] Toyota Res Inst, Stanford, CA USA
来源
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年
关键词
D O I
10.1109/CVPR52688.2022.00293
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
What makes a video task uniquely suited for videos, beyond what can be understood from a single image? Building on recent progress in self-supervised image-language models, we revisit this question in the context of video and language tasks. We propose the atemporal probe (ATP), a new model for video-language analysis which provides a stronger bound on the baseline accuracy of multimodal models constrained by image-level understanding. By applying this model to standard discriminative video and language tasks, such as video question answering and text-to-video retrieval, we characterize the limitations and potential of current video-language benchmarks. We find that understanding of event temporality is often not necessary to achieve strong or state-of-the-art performance, even compared with recent large-scale video-language models and in contexts intended to benchmark deeper video-level understanding. We also demonstrate how ATP can improve both video-language dataset and model design. We describe a technique for leveraging ATP to better disentangle dataset subsets with a higher concentration of temporally challenging data, improving benchmarking efficacy for causal and temporal understanding. Further, we show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.(1).
引用
收藏
页码:2907 / 2917
页数:11
相关论文
共 62 条
[1]  
[Anonymous], 2020, EUR C COMP VIS
[2]  
[Anonymous], 2019, CVPR, DOI DOI 10.1109/CVPR.2019.00137
[3]  
[Anonymous], 2017, CVPR, DOI DOI 10.1109/CVPR.2017.435
[4]  
[Anonymous], 2015, Microsoft COCO captions: Data collection and evaluation server
[5]  
Antoine SM, 2019, EUR HEART J-CASE REP, V3
[6]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[7]  
Bain Max, 2021, ARXIV210400650
[8]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[9]   Large Scale Holistic Video Understanding [J].
Diba, Ali ;
Fayyaz, Mohsen ;
Sharma, Vivek ;
Paluri, Manohar ;
Gall, Jurgen ;
Stiefelhagen, Rainer ;
Van Gool, Luc .
COMPUTER VISION - ECCV 2020, PT V, 2020, 12350 :593-610
[10]  
Dosovitskiy A, 2020, ARXIV