A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

被引：153

作者：

Feichtenhofer, Christoph ^{[1
]}

Fan, Haoqi ^{[1
]}

Xiong, Bo ^{[1
]}

Girshick, Ross ^{[1
]}

He, Kaiming ^{[1
]}

机构：

[1] Facebook AI Res FAIR, Paris, France

来源：

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 | 2021年

关键词：

D O I：

10.1109/CVPR46437.2021.00331

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present a large-scale study on unsupervised spatiotemporal representation learning from videos. With a unified perspective on four recent image-based frameworks, we study a simple objective that can easily generalize all these methods to space-time. Our objective encourages temporally-persistent features in the same video, and in spite of its simplicity, it works surprisingly well across: (i) different unsupervised frameworks, (ii) pre-training datasets, (iii) downstream datasets, and (iv) backbone architectures. We draw a series of intriguing observations from this study, e.g., we discover that encouraging long-spanned persistency can be effective even if the timespan is 60 seconds. In addition to state-of-the-art results in multiple benchmarks, we report a few promising cases in which unsupervised pre-training can outperform its supervised counterpart.

引用

页码：3298 / 3308

页数：11

共 96 条

[71]

Purushwalkam S, 2020, ADV NEUR IN

[72]

Qian R, 2020, ARXIV200803800

[73] Hepatitis B Virus Stimulated Fibronectin Facilitates Viral Maintenance and Replication through Two Distinct Mechanisms [J].

Ren, Sheng ;

Wang, Jun ;

Chen, Tie-Long ;

Li, Hao-Yu ;

Wan, Yu-Shun ;

Peng, Nan-Fang ;

Gui, Xi-En ;

Zhu, Ying .

PLOS ONE, 2016, 11 (03)

[74]

Richemond Pierre H, 2020, ARXIV201010241

[75] Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding [J].

Sigurdsson, Gunnar A. ;

Varol, Gul ;

Wang, Xiaolong ;

Farhadi, Ali ;

Laptev, Ivan ;

Gupta, Abhinav .

COMPUTER VISION - ECCV 2016, PT I, 2016, 9905 :510-526

[76]

Simonyan K, 2015, Arxiv, DOI arXiv:1409.1556

[77]

Soomro K., 2012, CRCVTR1201, V1212, P0402

[78]

Srivastava N, 2015, PR MACH LEARN RES, V37, P843

[79]

Sun Chen, 2019, Learning video representations using contrastive bidirectional transformer

[80]

Szegedy C, 2015, PROC CVPR IEEE, P1, DOI 10.1109/CVPR.2015.7298594

← 1 2 3 4 5 6 7 8 9 10 →