Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey

被引：1010

作者：

Jing, Longlong ^{[1
]}

Tian, Yingli ^{[2
,3
]}

机构：

[1] CUNY, Grad Ctr, Dept Comp Sci, New York, NY 10016 USA

[2] CUNY City Coll, Dept Elect Engn, New York, NY 10031 USA

[3] CUNY, Grad Ctr, Dept Comp Sci, New York, NY 10031 USA

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2021年 / 43卷 / 11期

基金：

美国国家科学基金会;

关键词：

Task analysis; Visualization; Videos; Training; Learning systems; Feature extraction; Annotations; Self-supervised learning; unsupervised learning; convolutional neural network; transfer learning; deep learning; CLASSIFICATION; MACHINE; SCENES;

D O I：

10.1109/TPAMI.2020.2992393

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large-scale labeled data are generally required to train deep neural networks in order to obtain better performance in visual feature learning from images or videos for computer vision applications. To avoid extensive cost of collecting and annotating large-scale datasets, as a subset of unsupervised learning methods, self-supervised learning methods are proposed to learn general image and video features from large-scale unlabeled data without using any human-annotated labels. This paper provides an extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos. First, the motivation, general pipeline, and terminologies of this field are described. Then the common deep neural network architectures that used for self-supervised learning are summarized. Next, the schema and evaluation metrics of self-supervised learning methods are reviewed followed by the commonly used datasets for images, videos, audios, and 3D data, as well as the existing self-supervised visual feature learning methods. Finally, quantitative performance comparisons of the reviewed methods on benchmark datasets are summarized and discussed for both image and video feature learning. At last, this paper is concluded and lists a set of promising future directions for self-supervised visual feature learning.

引用

页码：4037 / 4058

页数：22

共 186 条

[1]

Abu-El-Haija S., 2016, arXiv

[2]

Achlioptas P., 2017, CoRR, abs/1707.02392

[3] Learning to See by Moving [J].

Agrawal, Pulkit ;

Carreira, Joao ;

Malik, Jitendra .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :37-45

[4] Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video Action Recognition [J].

Ahsan, Unaiza ;

Madhok, Rishi ;

Essa, Irfan .

2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, :179-189

[5]

Alwassel Humam, 2019, ARXIV191112667

[6]

[Anonymous], 2016, INT C LEARN REPR ICL

[7]

[Anonymous], 2016, P NIPS

[8]

[Anonymous], 2017, IEEE I CONF COMP VIS, DOI DOI 10.1109/ICCV.2017.244

[9] Look, Listen and Learn [J].

Arandjelovic, Relja ;

Zisserman, Andrew .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :609-617

[10]

Arjovsky M, 2017, PR MACH LEARN RES, V70

← 1 2 3 4 5 6 7 8 9 10 →