A Multi-Modal, Discriminative and Spatially Invariant CNN for RGB-D Object Labeling

被引:42
作者
Asif, Umar [1 ]
Bennamoun, Mohammed [2 ]
Sohel, Ferdous A. [3 ]
机构
[1] IBM Res, Melbourne, Vic 3053, Australia
[2] Univ Western Australia, Crawley, WA 6009, Australia
[3] Murdoch Univ, Murdoch, WA 6150, Australia
基金
澳大利亚研究理事会;
关键词
RGB-D object recognition; 3D scene labeling; semantic segmentation; RECOGNITION; SCENE;
D O I
10.1109/TPAMI.2017.2747134
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While deep convolutional neural networks have shown a remarkable success in image classification, the problems of inter-class similarities, intra-class variances, the effective combination of multi-modal data, and the spatial variability in images of objects remain to be major challenges. To address these problems, this paper proposes a novel framework to learn a discriminative and spatially invariant classification model for object and indoor scene recognition using multi-modal RGB-D imagery. This is achieved through three postulates: 1) spatial invariance-this is achieved by combining a spatial transformer network with a deep convolutional neural network to learn features which are invariant to spatial translations, rotations, and scale changes, 2) high discriminative capability-this is achieved by introducing Fisher encoding within the CNN architecture to learn features which have small inter-class similarities and large intra-class compactness, and 3) multi-modal hierarchical fusion-this is achieved through the regularization of semantic segmentation to a multi-modal CNN architecture, where class probabilities are estimated at different hierarchical levels (i.e., image- and pixel-levels), and fused into a Conditional Random Field (CRF)-based inference hypothesis, the optimization of which produces consistent class labels in RGB-D images. Extensive experimental evaluations on RGB-D object and scene datasets, and live video streams (acquired from Kinect) show that our framework produces superior object and scene classification results compared to the state-of-the-art methods.
引用
收藏
页码:2051 / 2065
页数:15
相关论文
共 45 条
[1]  
[Anonymous], 2012, P 26 ANN C NEUR PROC, DOI DOI 10.1002/2014GB005021
[2]  
[Anonymous], 2017, IEEE T PATTERN ANAL
[3]  
[Anonymous], 2015, ACM INT C MULT
[4]  
[Anonymous], 2015, ICLR
[5]  
[Anonymous], IEEE T PATTERN ANAL
[6]  
[Anonymous], 2017, SEGNET DEEP CONVOLUT
[7]  
[Anonymous], PROC CVPR IEEE
[8]  
[Anonymous], 2008, VLFeat: An open and portable library of computer vision algorithms
[9]  
[Anonymous], 2015, PROC 28 INT C NEURAL
[10]   RGB-D Object Recognition and Grasp Detection Using Hierarchical Cascaded Forests [J].
Asif, Umar ;
Bennamoun, Mohammed ;
Sohel, Ferdous A. .
IEEE TRANSACTIONS ON ROBOTICS, 2017, 33 (03) :547-564