RGB-D Scene Recognition via Spatial-Related Multi-Modal Feature Learning

被引：15

作者：

Xiong, Zhitong ^{[1
,2
]}

Yuan, Yuan ^{[1
]}

Wang, Qi ^{[1
,2
]}

机构：

[1] Northwestern Polytech Univ, Sch Comp Sci, Xian 710072, Shaanxi, Peoples R China

[2] Northwestern Polytech Univ, Ctr OPT IMagery Anal & Learning OPTIMAL, Xian 710072, Shaanxi, Peoples R China

来源：

IEEE ACCESS | 2019年 / 7卷

基金：

中国国家自然科学基金;

关键词：

RGB-D; scene recognition; global and local features; multi-modal feature learning;

D O I：

10.1109/ACCESS.2019.2932080

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

RGB-D image-based scene recognition has achieved significant performance improvement with the development of deep learning methods. While convolutional neural networks can learn high-semantic level features for object recognition, these methods still have limitations for RGB-D scene classification. One limitation is that how to learn better multi-modal features for the RGB-D scene recognition is still an open problem. Another limitation is that the scene images are usually not object-centric and with great spatial variability. Thus, vanilla full-image CNN features maybe not optimal for scene recognition. Considering these problems, in this paper, we propose a compact and effective framework for RGB-D scene recognition. Specifically, we make the following contributions: 1) A novel RGB-D scene recognition framework is proposed to explicitly learn the global modal-specific and local modal-consistent features simultaneously. Different from existing approaches, local CNN features are considered for the learning of modal-consistent representations; 2) key Feature Selection (KFS) module is designed, which can adaptively select important local features from the high-semantic level CNN feature maps. It is more efficient and effective than object detection and dense patch-sampling based methods, and; 3) a triplet correlation loss and a spatial-attention similarity loss are proposed for the training of KFS module. Under the supervision of the proposed loss functions, the network can learn import local features of two modalities with no need for extra annotations. Finally, by concatenating the global and local features together, the proposed framework can achieve new state-of-the-art scene recognition performance on the SUN RGB-D dataset and NYU Depth version 2 (NYUD v2) dataset.

引用

页码：106739 / 106747

页数：9

共 50 条

[41] Traffic Sign Recognition via Multi-Modal Tree-Structure Embedded Multi-Task Learning
Lu, Xiao
Wang, Yaonan
Zhou, Xuanyu
Zhang, Zhenjun
Ling, Zhigang
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2017, 18 (04) : 960 - 972
[42] Semi-supervised Grounding Alignment for Multi-modal Feature Learning
Chou, Shih-Han
Fan, Zicong
Little, James J.
Sigal, Leonid
2022 19TH CONFERENCE ON ROBOTS AND VISION (CRV 2022), 2022, : 48 - 57
[43] A multi-scale descriptor for real time RGB-D hand gesture recognition
Huang, Yao
Yang, Jianyu
PATTERN RECOGNITION LETTERS, 2021, 144 : 97 - 104
[44] RGB-D Salient Object Detection Based on Cross-Modal and Cross-Level Feature Fusion
Peng, Yanbin
Zhai, Zhinian
Feng, Mingkun
IEEE ACCESS, 2024, 12 : 45134 - 45146
[45] Human Gait Recognition Based on Frontal-View Walking Sequences Using Multi-modal Feature Representations and Learning
Muqing Deng
Zebang Zhong
Yi Zou
Yanjiao Wang
Kaiwei Wang
Junrong Liao
Neural Processing Letters, 56
[46] Human Gait Recognition Based on Frontal-View Walking Sequences Using Multi-modal Feature Representations and Learning
Deng, Muqing
Zhong, Zebang
Zou, Yi
Wang, Yanjiao
Wang, Kaiwei
Liao, Junrong
NEURAL PROCESSING LETTERS, 2024, 56 (02)
[47] FFMT: Unsupervised RGB-D Point Cloud Registration via Fusion Feature Matching with Transformer
Qiu, Jiacun
Han, Zhenqi
Liu, Lizhaung
Zhang, Jialu
APPLIED SCIENCES-BASEL, 2025, 15 (05):
[48] Multi-level cross-modal interaction network for RGB-D salient object detection
Huang, Zhou
Chen, Huai-Xin
Zhou, Tao
Yang, Yun-Zhi
Liu, Bi-Yuan
NEUROCOMPUTING, 2021, 452 : 200 - 211
[49] Indoor scene understanding via RGB-D image segmentation employing depth-based CNN and CRFs
Wei Li
Junhua Gu
Yongfeng Dong
Yao Dong
Jungong Han
Multimedia Tools and Applications, 2020, 79 : 35475 - 35489
[50] Attention-guided cross-modal multiple feature aggregation network for RGB-D salient object detection
Chen, Bojian
Wu, Wenbin
Li, Zhezhou
Han, Tengfei
Chen, Zhuolei
Zhang, Weihao
ELECTRONIC RESEARCH ARCHIVE, 2024, 32 (01): : 643 - 669

← 1 2 3 4 5 →