A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

被引:139
作者
Feichtenhofer, Christoph [1 ]
Fan, Haoqi [1 ]
Xiong, Bo [1 ]
Girshick, Ross [1 ]
He, Kaiming [1 ]
机构
[1] Facebook AI Res FAIR, Paris, France
来源
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 | 2021年
关键词
D O I
10.1109/CVPR46437.2021.00331
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a large-scale study on unsupervised spatiotemporal representation learning from videos. With a unified perspective on four recent image-based frameworks, we study a simple objective that can easily generalize all these methods to space-time. Our objective encourages temporally-persistent features in the same video, and in spite of its simplicity, it works surprisingly well across: (i) different unsupervised frameworks, (ii) pre-training datasets, (iii) downstream datasets, and (iv) backbone architectures. We draw a series of intriguing observations from this study, e.g., we discover that encouraging long-spanned persistency can be effective even if the timespan is 60 seconds. In addition to state-of-the-art results in multiple benchmarks, we report a few promising cases in which unsupervised pre-training can outperform its supervised counterpart.
引用
收藏
页码:3298 / 3308
页数:11
相关论文
共 96 条
  • [1] Learning to See by Moving
    Agrawal, Pulkit
    Carreira, Joao
    Malik, Jitendra
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 37 - 45
  • [2] Alayrac Jean -Baptiste, 2020, NeurIPS
  • [3] Alwassel H., 2020, Advances in Neural Information Processing Systems, V33, P9758
  • [4] [Anonymous], 2020, NeurIPS
  • [5] [Anonymous], 2018, Detectron
  • [6] [Anonymous], 2018, P ECCV
  • [7] [Anonymous], P INT C ROB AUT
  • [8] Look, Listen and Learn
    Arandjelovic, Relja
    Zisserman, Andrew
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 609 - 617
  • [9] Becker Suzanna, 1997, NEURIPS
  • [10] Benaim S., 2020, P CVPR