Generic 3D Representation via Pose Estimation and Matching

被引：34

作者：

Zamir, Amir R. ^{[1
]}

Wekel, Tilman ^{[1
]}

Agrawal, Pulkit ^{[2
]}

Wei, Colin ^{[1
]}

Malik, Jitendra ^{[2
]}

Savarese, Silvio ^{[1
]}

机构：

[1] Stanford Univ, Stanford, CA 94305 USA

[2] Univ Calif Berkeley, Berkeley, CA 94720 USA

来源：

COMPUTER VISION - ECCV 2016, PT III | 2016年 / 9907卷

关键词：

Generic vision; Representation; Descriptor learning; Pose estimation; Wide-baseline matching; Street view;

D O I：

10.1007/978-3-319-46487-9_33

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Though a large body of computer vision research has investigated developing generic semantic representations, efforts towards developing a similar representation for 3D has been limited. In this paper, we learn a generic 3D representation through solving a set of foundational proxy 3D tasks: object-centric camera pose estimation and wide baseline feature matching. Our method is based upon the premise that by providing supervision over a set of carefully selected foundational tasks, generalization to novel tasks and abstraction capabilities can be achieved. We empirically show that the internal representation of a multi-task ConvNet trained to solve the above core problems generalizes to novel 3D tasks (e.g., scene layout estimation, object pose estimation, surface normal estimation) without the need for fine-tuning and shows traits of abstraction abilities (e.g., cross modality pose estimation). In the context of the core supervised tasks, we demonstrate our representation achieves state-of-the-art wide baseline feature matching results without requiring apriori rectification (unlike SIFT and the majority of learnt features). We also show 6DOF camera pose estimation given a pair local image patches. The accuracy of both supervised tasks come comparable to humans. Finally, we contribute a large-scale dataset composed of object-centric street view scenes along with point correspondences and camera pose information, and conclude with a discussion on the learned representation and open research questions.

引用

页码：535 / 553

页数：19

共 65 条

[1] Building Rome in a Day [J].

Agarwal, Sameer ;

Furukawa, Yasutaka ;

Snavely, Noah ;

Simon, Ian ;

Curless, Brian ;

Seitz, Steven M. ;

Szeliski, Richard .

COMMUNICATIONS OF THE ACM, 2011, 54 (10) :105-112

[2]

Agrawal Pulkit., 2015, Learning to see by moving

[3]

Alahi A, 2012, PROC CVPR IEEE, P510, DOI 10.1109/CVPR.2012.6247715

[4]

[Anonymous], P 6 ACM MULT SYST C

[5]

[Anonymous], 2014, CoRR

[6]

[Anonymous], 2015, P INT C COMP VIS ICC

[7]

[Anonymous], 2011, VisualSFM: A visual structure from motion system

[8]

[Anonymous], 2016, arXiv

[9]

[Anonymous], 2015, abs/1506.03365

[10]

[Anonymous], 2013, 31 INT C MACH LEARN

← 1 2 3 4 5 6 7 →