A Multi-Modal, Discriminative and Spatially Invariant CNN for RGB-D Object Labeling

被引：42

作者：

Asif, Umar ^{[1
]}

Bennamoun, Mohammed ^{[2
]}

Sohel, Ferdous A. ^{[3
]}

机构：

[1] IBM Res, Melbourne, Vic 3053, Australia

[2] Univ Western Australia, Crawley, WA 6009, Australia

[3] Murdoch Univ, Murdoch, WA 6150, Australia

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2018年 / 40卷 / 09期

基金：

澳大利亚研究理事会;

关键词：

RGB-D object recognition; 3D scene labeling; semantic segmentation; RECOGNITION; SCENE;

D O I：

10.1109/TPAMI.2017.2747134

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

While deep convolutional neural networks have shown a remarkable success in image classification, the problems of inter-class similarities, intra-class variances, the effective combination of multi-modal data, and the spatial variability in images of objects remain to be major challenges. To address these problems, this paper proposes a novel framework to learn a discriminative and spatially invariant classification model for object and indoor scene recognition using multi-modal RGB-D imagery. This is achieved through three postulates: 1) spatial invariance-this is achieved by combining a spatial transformer network with a deep convolutional neural network to learn features which are invariant to spatial translations, rotations, and scale changes, 2) high discriminative capability-this is achieved by introducing Fisher encoding within the CNN architecture to learn features which have small inter-class similarities and large intra-class compactness, and 3) multi-modal hierarchical fusion-this is achieved through the regularization of semantic segmentation to a multi-modal CNN architecture, where class probabilities are estimated at different hierarchical levels (i.e., image- and pixel-levels), and fused into a Conditional Random Field (CRF)-based inference hypothesis, the optimization of which produces consistent class labels in RGB-D images. Extensive experimental evaluations on RGB-D object and scene datasets, and live video streams (acquired from Kinect) show that our framework produces superior object and scene classification results compared to the state-of-the-art methods.

引用

页码：2051 / 2065

页数：15

共 45 条

[1]

[Anonymous], 2012, P 26 ANN C NEUR PROC, DOI DOI 10.1002/2014GB005021

[2]

[Anonymous], 2017, IEEE T PATTERN ANAL

[3]

[Anonymous], 2015, ACM INT C MULT

[4]

[Anonymous], 2015, ICLR

[5]

[Anonymous], IEEE T PATTERN ANAL

[6]

[Anonymous], 2017, SEGNET DEEP CONVOLUT

[7]

[Anonymous], PROC CVPR IEEE

[8]

[Anonymous], 2008, VLFeat: An open and portable library of computer vision algorithms

[9]

[Anonymous], 2015, PROC 28 INT C NEURAL

[10] RGB-D Object Recognition and Grasp Detection Using Hierarchical Cascaded Forests [J].

Asif, Umar ;

Bennamoun, Mohammed ;

Sohel, Ferdous A. .

IEEE TRANSACTIONS ON ROBOTICS, 2017, 33 (03) :547-564

← 1 2 3 4 5 →