Self-Supervised Visual Descriptor Learning for Dense Correspondence

被引：109

作者：

Schmidt, Tanner ^{[1
]}

Newcombe, Richard ^{[2
]}

Fox, Dieter ^{[1
]}

机构：

[1] Univ Washington, Dept Comp Sci & Engn, Seattle, WA 98105 USA

[2] Oculus Res, Redmond, WA 98053 USA

来源：

IEEE ROBOTICS AND AUTOMATION LETTERS | 2017年 / 2卷 / 02期

关键词：

Recognition; RGB-D perception; visual learning;

D O I：

10.1109/LRA.2016.2634089

中图分类号：

TP24 [机器人技术];

学科分类号：

080202 ; 1405 ;

摘要：

Robust estimation of correspondences between image pixels is an important problem in robotics, with applications in tracking, mapping, and recognition of objects, environments, and other agents. Correspondence estimation has long been the domain of hand-engineered features, but more recently deep learning techniques have provided powerful tools for learning features from raw data. The drawback of the latter approach is that a vast amount of (labeled, typically) training data are required for learning. This paper advocates a new approach to learning visual descriptors for dense correspondence estimation in which we harness the power of a strong three-dimensional generative model to automatically label correspondences in RGB-D video data. A fully convolutional network is trained using a contrastive loss to produce viewpoint-and lighting-invariant descriptors. As a proof of concept, we collected two datasets: The first depicts the upper torso and head of the same person in widely varied settings, and the second depicts an office as seen on multiple days with objects rearranged within. Our datasets focus on revisitation of the same objects and environments, and we show that by training the CNN only from local tracking data, our learned visual descriptor generalizes toward identifying nonlabeled correspondences across videos. We furthermore show that our approach to descriptor learning can be used to achieve stateof-the-art single-frame localization results on the MSR 7-scenes dataset without using any labels identifying correspondences between separate videos of the same scenes at training time.

引用

页码：420 / 427

页数：8

共 29 条

[1]

[Anonymous], 2016, ARXIV160401093

[2]

[Anonymous], 2012, PROC CVPR IEEE

[3]

[Anonymous], 2016, J MACH LEARN RES

[4] SURF: Speeded up robust features [J].

Bay, Herbert ;

Tuytelaars, Tinne ;

Van Gool, Luc .

COMPUTER VISION - ECCV 2006 , PT 1, PROCEEDINGS, 2006, 3951 :404-417

[5]

Brachmann E, 2014, LECT NOTES COMPUT SC, V8690, P536, DOI 10.1007/978-3-319-10605-2_35

[6] Learning a similarity metric discriminatively, with application to face verification [J].

Chopra, S ;

Hadsell, R ;

LeCun, Y .

2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2005, :539-546

[7]

Choy C. B., 2016, ARXIV160603558

[8] FAB-MAP: Probabilistic localization and mapping in the space of appearance [J].

Cummins, Mark ;

Newman, Paul .

INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH, 2008, 27 (06) :647-665

[9]

Hadsell R., 2006, IEEE C COMP VIS PATT, P1735, DOI DOI 10.1109/CVPR.2006.100

[10]

Han XF, 2015, PROC CVPR IEEE, P3279, DOI 10.1109/CVPR.2015.7298948

← 1 2 3 →