RGB-D Scene Recognition via Spatial-Related Multi-Modal Feature Learning

被引:15
作者
Xiong, Zhitong [1 ,2 ]
Yuan, Yuan [1 ]
Wang, Qi [1 ,2 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Xian 710072, Shaanxi, Peoples R China
[2] Northwestern Polytech Univ, Ctr OPT IMagery Anal & Learning OPTIMAL, Xian 710072, Shaanxi, Peoples R China
基金
中国国家自然科学基金;
关键词
RGB-D; scene recognition; global and local features; multi-modal feature learning;
D O I
10.1109/ACCESS.2019.2932080
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
RGB-D image-based scene recognition has achieved significant performance improvement with the development of deep learning methods. While convolutional neural networks can learn high-semantic level features for object recognition, these methods still have limitations for RGB-D scene classification. One limitation is that how to learn better multi-modal features for the RGB-D scene recognition is still an open problem. Another limitation is that the scene images are usually not object-centric and with great spatial variability. Thus, vanilla full-image CNN features maybe not optimal for scene recognition. Considering these problems, in this paper, we propose a compact and effective framework for RGB-D scene recognition. Specifically, we make the following contributions: 1) A novel RGB-D scene recognition framework is proposed to explicitly learn the global modal-specific and local modal-consistent features simultaneously. Different from existing approaches, local CNN features are considered for the learning of modal-consistent representations; 2) key Feature Selection (KFS) module is designed, which can adaptively select important local features from the high-semantic level CNN feature maps. It is more efficient and effective than object detection and dense patch-sampling based methods, and; 3) a triplet correlation loss and a spatial-attention similarity loss are proposed for the training of KFS module. Under the supervision of the proposed loss functions, the network can learn import local features of two modalities with no need for extra annotations. Finally, by concatenating the global and local features together, the proposed framework can achieve new state-of-the-art scene recognition performance on the SUN RGB-D dataset and NYU Depth version 2 (NYUD v2) dataset.
引用
收藏
页码:106739 / 106747
页数:9
相关论文
共 50 条
  • [41] Traffic Sign Recognition via Multi-Modal Tree-Structure Embedded Multi-Task Learning
    Lu, Xiao
    Wang, Yaonan
    Zhou, Xuanyu
    Zhang, Zhenjun
    Ling, Zhigang
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2017, 18 (04) : 960 - 972
  • [42] Semi-supervised Grounding Alignment for Multi-modal Feature Learning
    Chou, Shih-Han
    Fan, Zicong
    Little, James J.
    Sigal, Leonid
    2022 19TH CONFERENCE ON ROBOTS AND VISION (CRV 2022), 2022, : 48 - 57
  • [43] A multi-scale descriptor for real time RGB-D hand gesture recognition
    Huang, Yao
    Yang, Jianyu
    PATTERN RECOGNITION LETTERS, 2021, 144 : 97 - 104
  • [44] RGB-D Salient Object Detection Based on Cross-Modal and Cross-Level Feature Fusion
    Peng, Yanbin
    Zhai, Zhinian
    Feng, Mingkun
    IEEE ACCESS, 2024, 12 : 45134 - 45146
  • [45] Human Gait Recognition Based on Frontal-View Walking Sequences Using Multi-modal Feature Representations and Learning
    Muqing Deng
    Zebang Zhong
    Yi Zou
    Yanjiao Wang
    Kaiwei Wang
    Junrong Liao
    Neural Processing Letters, 56
  • [46] Human Gait Recognition Based on Frontal-View Walking Sequences Using Multi-modal Feature Representations and Learning
    Deng, Muqing
    Zhong, Zebang
    Zou, Yi
    Wang, Yanjiao
    Wang, Kaiwei
    Liao, Junrong
    NEURAL PROCESSING LETTERS, 2024, 56 (02)
  • [47] FFMT: Unsupervised RGB-D Point Cloud Registration via Fusion Feature Matching with Transformer
    Qiu, Jiacun
    Han, Zhenqi
    Liu, Lizhaung
    Zhang, Jialu
    APPLIED SCIENCES-BASEL, 2025, 15 (05):
  • [48] Multi-level cross-modal interaction network for RGB-D salient object detection
    Huang, Zhou
    Chen, Huai-Xin
    Zhou, Tao
    Yang, Yun-Zhi
    Liu, Bi-Yuan
    NEUROCOMPUTING, 2021, 452 : 200 - 211
  • [49] Indoor scene understanding via RGB-D image segmentation employing depth-based CNN and CRFs
    Wei Li
    Junhua Gu
    Yongfeng Dong
    Yao Dong
    Jungong Han
    Multimedia Tools and Applications, 2020, 79 : 35475 - 35489
  • [50] Attention-guided cross-modal multiple feature aggregation network for RGB-D salient object detection
    Chen, Bojian
    Wu, Wenbin
    Li, Zhezhou
    Han, Tengfei
    Chen, Zhuolei
    Zhang, Weihao
    ELECTRONIC RESEARCH ARCHIVE, 2024, 32 (01): : 643 - 669