VS-Net: Voting with Segmentation for Visual Localization

被引：24

作者：

Huang, Zhaoyang ^{[1
,2
,4
]}

Zhou, Han ^{[1
,4
]}

Li, Yijin ^{[1
,4
]}

Yang, Bangbang ^{[1
,4
]}

Xu, Yan ^{[2
]}

Zhou, Xiaowei ^{[1
,4
]}

Bao, Hujun ^{[1
,4
]}

Zhang, Guofeng ^{[1
,4
]}

Li, Hongsheng ^{[2
,3
]}

机构：

[1] Zhejiang Univ, State Key Lab Cad & CG, Hangzhou, Peoples R China

[2] Chinese Univ Hong Kong, CUHK SenseTime Joint Lab, Hong Kong, Peoples R China

[3] Xidian Univ, Sch CST, Xian, Peoples R China

[4] ZJU SenseTime Joint Lab 3D Vis, Hangzhou, Peoples R China

来源：

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 | 2021年

关键词：

D O I：

10.1109/CVPR46437.2021.00604

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual localization is of great importance in robotics and computer vision. Recently, scene coordinate regression based methods have shown good performance in visual localization in small static scenes. However, it still estimates camera poses from many inferior scene coordinates. To address this problem, we propose a novel visual localization framework that establishes 2D-to-3D correspondences between the query image and the 3D map with a series of learnable scene-specific landmarks. In the landmark generation stage, the 3D surfaces of the target scene are over-segmented into mosaic patches whose centers are regarded as the scene-specific landmarks. To robustly and accurately recover the scene-specific landmarks, we propose the Voting with Segmentation Network (VS-Net) to segment the pixels into different landmark patches with a segmentation branch and estimate the landmark locations within each patch with a landmark location voting branch. Since the number of landmarks in a scene may reach up to 5000, training a segmentation network with such a large number of classes is both computation and memory costly for the commonly used cross-entropy loss. We propose a novel prototype-based triplet loss with hard negative mining, which is able to train semantic segmentation networks with a large number of labels efficiently. Our proposed VS-Net is extensively tested on multiple public benchmarks and can outperform state-of-the-art visual localization methods. Code and models are available at https://github.com/zju3dv/VS-net.

引用

页码：6097 / 6107

页数：11

共 67 条

[1] Building Rome in a Day [J].

Agarwal, Sameer ;

Furukawa, Yasutaka ;

Snavely, Noah ;

Simon, Ian ;

Curless, Brian ;

Seitz, Steven M. ;

Szeliski, Richard .

COMMUNICATIONS OF THE ACM, 2011, 54 (10) :105-112

[2]

[Anonymous], 2011, Visualsfm: A visual structure from motion system

[3]

[Anonymous], 2016, ARXIV160307022

[4]

[Anonymous], 2017, 2017 IEEE INT C ROB

[5]

Antonio F., 1992, Graph. Gems III, P199, DOI DOI 10.1016/B978-0-08-050755-2.50045-2

[6]

Arandjelovic R, 2018, IEEE T PATTERN ANAL, V40, P1437, DOI [10.1109/TPAMI.2017.2711011, 10.1109/CVPR.2016.572]

[7] Wide Area Localization on Mobile Phones [J].

Arth, Clemens ;

Wagner, Daniel ;

Klopschitz, Manfred ;

Irschara, Arnold ;

Schmalstieg, Dieter .

2009 8TH IEEE INTERNATIONAL SYMPOSIUM ON MIXED AND AUGMENTED REALITY - SCIENCE AND TECHNOLOGY, 2009, :73-82

[8] Ensemble Deep Manifold Similarity Learning using Hard Proxies [J].

Aziere, Nicolas ;

Todorovic, Sinisa .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :7291-7299

[9] Speeded-Up Robust Features (SURF) [J].

Bay, Herbert ;

Ess, Andreas ;

Tuytelaars, Tinne ;

Van Gool, Luc .

COMPUTER VISION AND IMAGE UNDERSTANDING, 2008, 110 (03) :346-359

[10] Expert Sample Consensus Applied to Camera Re-Localization [J].

Brachmann, Eric ;

Rother, Carsten .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :7524-7533

← 1 2 3 4 5 6 7 →