A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

被引:153
作者
Feichtenhofer, Christoph [1 ]
Fan, Haoqi [1 ]
Xiong, Bo [1 ]
Girshick, Ross [1 ]
He, Kaiming [1 ]
机构
[1] Facebook AI Res FAIR, Paris, France
来源
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 | 2021年
关键词
D O I
10.1109/CVPR46437.2021.00331
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a large-scale study on unsupervised spatiotemporal representation learning from videos. With a unified perspective on four recent image-based frameworks, we study a simple objective that can easily generalize all these methods to space-time. Our objective encourages temporally-persistent features in the same video, and in spite of its simplicity, it works surprisingly well across: (i) different unsupervised frameworks, (ii) pre-training datasets, (iii) downstream datasets, and (iv) backbone architectures. We draw a series of intriguing observations from this study, e.g., we discover that encouraging long-spanned persistency can be effective even if the timespan is 60 seconds. In addition to state-of-the-art results in multiple benchmarks, we report a few promising cases in which unsupervised pre-training can outperform its supervised counterpart.
引用
收藏
页码:3298 / 3308
页数:11
相关论文
共 96 条
[71]  
Purushwalkam S, 2020, ADV NEUR IN
[72]  
Qian R, 2020, ARXIV200803800
[73]   Hepatitis B Virus Stimulated Fibronectin Facilitates Viral Maintenance and Replication through Two Distinct Mechanisms [J].
Ren, Sheng ;
Wang, Jun ;
Chen, Tie-Long ;
Li, Hao-Yu ;
Wan, Yu-Shun ;
Peng, Nan-Fang ;
Gui, Xi-En ;
Zhu, Ying .
PLOS ONE, 2016, 11 (03)
[74]  
Richemond Pierre H, 2020, ARXIV201010241
[75]   Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding [J].
Sigurdsson, Gunnar A. ;
Varol, Gul ;
Wang, Xiaolong ;
Farhadi, Ali ;
Laptev, Ivan ;
Gupta, Abhinav .
COMPUTER VISION - ECCV 2016, PT I, 2016, 9905 :510-526
[76]  
Simonyan K, 2015, Arxiv, DOI arXiv:1409.1556
[77]  
Soomro K., 2012, CRCVTR1201, V1212, P0402
[78]  
Srivastava N, 2015, PR MACH LEARN RES, V37, P843
[79]  
Sun Chen, 2019, Learning video representations using contrastive bidirectional transformer
[80]  
Szegedy C, 2015, PROC CVPR IEEE, P1, DOI 10.1109/CVPR.2015.7298594