Revisiting the "Video" in Video-Language Understanding

被引:61
作者
Buch, Shyamal [1 ]
Eyzaguirre, Cristobal [1 ]
Gaidon, Adrien [2 ]
Wu, Jiajun [1 ]
Li Fei-Fei [1 ]
Niebles, Juan Carlos [1 ]
机构
[1] Stanford Univ, Stanford, CA 94305 USA
[2] Toyota Res Inst, Stanford, CA USA
来源
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年
关键词
D O I
10.1109/CVPR52688.2022.00293
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
What makes a video task uniquely suited for videos, beyond what can be understood from a single image? Building on recent progress in self-supervised image-language models, we revisit this question in the context of video and language tasks. We propose the atemporal probe (ATP), a new model for video-language analysis which provides a stronger bound on the baseline accuracy of multimodal models constrained by image-level understanding. By applying this model to standard discriminative video and language tasks, such as video question answering and text-to-video retrieval, we characterize the limitations and potential of current video-language benchmarks. We find that understanding of event temporality is often not necessary to achieve strong or state-of-the-art performance, even compared with recent large-scale video-language models and in contexts intended to benchmark deeper video-level understanding. We also demonstrate how ATP can improve both video-language dataset and model design. We describe a technique for leveraging ATP to better disentangle dataset subsets with a higher concentration of temporally challenging data, improving benchmarking efficacy for causal and temporal understanding. Further, we show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.(1).
引用
收藏
页码:2907 / 2917
页数:11
相关论文
共 62 条
  • [1] [Anonymous], 2019, CVPR, DOI DOI 10.1109/CVPR.2019.00137
  • [2] [Anonymous], 2017, CVPR, DOI DOI 10.1109/CVPR.2017.435
  • [3] Antoine SM, 2019, EUR HEART J-CASE REP, V3
  • [4] VQA: Visual Question Answering
    Antol, Stanislaw
    Agrawal, Aishwarya
    Lu, Jiasen
    Mitchell, Margaret
    Batra, Dhruv
    Zitnick, C. Lawrence
    Parikh, Devi
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
  • [5] Bain Max, 2021, ARXIV210400650
  • [6] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [7] Chen XL., 2015, CORR, V1504, P00325
  • [8] Diba Ali, 2020, Computer Vision - ECCV 2020 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12350), P593, DOI 10.1007/978-3-030-58558-7_35
  • [9] Dosovitskiy A., 2020, INT C LEARN REPR
  • [10] Engin Deniz., 2021, P IEEE CVF INT C COM, P2064