Lightweight cross-modal transformer for RGB-D salient object detection

被引:0
作者
Huang, Nianchang [1 ,2 ,3 ]
Yang, Yang [3 ]
Zhang, Qiang [1 ,2 ,3 ]
Han, Jungong [4 ]
Huang, Jin [3 ]
机构
[1] Xidian Univ, Key Lab Elect Equipment Struct Design, Minist Educ, Xian 710071, Shaanxi, Peoples R China
[2] Xidian Univ, State Key Lab Electromech Integrated Mfg High Perf, Xian 710071, Shaanxi, Peoples R China
[3] Xidian Univ, Ctr Complex Syst, Sch Mechanoelect Engn, Xian 710071, Shaanxi, Peoples R China
[4] Univ Sheffield, Dept Comp Sci, England, England
基金
中国博士后科学基金; 中国国家自然科学基金;
关键词
RGB-D salient object detection; Lightweight model cross-modal transformer;
D O I
10.1016/j.cviu.2024.104194
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, Transformer-based RGB-D salient object detection (SOD) models have pushed the performance to a new level. However, they come at the cost of consuming abundant resources, including memory and power, thus hindering their real-life applications. To remedy this situation, a novel lightweight cross- modal Transformer (LCT) for RGB-D SOD will be presented in this paper. Specifically, LCT will first reduce its parameters and computational costs by employing a middle-level feature fusion structure and taking a lightweight Transformer as the backbone. Then, with the aid of Transformers, it will compensate for performance degradation by effectively capturing the cross-modal and cross-level complementary information from the multi-modal input images. To this end, a cross-modal enhancement and fusion module (CEFM) with a lightweight channel-wise cross attention block (LCCAB) will be designed to capture the cross-modal complementary information effectively but with fewer costs. A bi-directional multi-level feature interaction module (Bi-MFIM) with a lightweight spatial-wise cross attention block (LSCAB) will be designed to capture the cross-level complementary context information. By virtue of CEFM and Bi-MFIM, the performance degradation caused by parameter reduction can be well compensated, thus boosting the performances. By doing so, our proposed model has only 2.8M parameters with 7.6G FLOPs and runs at 66 FPS. Furthermore, experimental results on several benchmark datasets show that our proposed model can achieve competitive or even better results than other models. Our code will be released on https://github.com/nexiakele/lightweight-cross-modalTransformer-LCT-for-RGB-D-SOD.
引用
收藏
页数:10
相关论文
共 50 条
[1]   EM-Trans: Edge-Aware Multimodal Transformer for RGB-D Salient Object Detection [J].
Chen, Geng ;
Wang, Qingyue ;
Dong, Bo ;
Ma, Ruitao ;
Liu, Nian ;
Fu, Huazhu ;
Xia, Yong .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2025, 36 (02) :3175-3188
[2]   Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation [J].
Chen, Liang-Chieh ;
Zhu, Yukun ;
Papandreou, George ;
Schroff, Florian ;
Adam, Hartwig .
COMPUTER VISION - ECCV 2018, PT VII, 2018, 11211 :833-851
[3]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[4]   Transformer with large convolution kernel decoder network for salient object detection in optical remote sensing images [J].
Dong, Pengwei ;
Wang, Bo ;
Cong, Runmin ;
Sun, Hai-Han ;
Li, Chongyi .
COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 240
[5]   BBS-Net: RGB-D Salient Object Detection with a Bifurcated Backbone Strategy Network [J].
Fan, Deng-Ping ;
Zhai, Yingjie ;
Borji, Ali ;
Yang, Jufeng ;
Shao, Ling .
COMPUTER VISION - ECCV 2020, PT XII, 2020, 12357 :275-292
[6]   Structure-measure: A New Way to Evaluate Foreground Maps [J].
Fan, Deng-Ping ;
Cheng, Ming-Ming ;
Liu, Yun ;
Li, Tao ;
Borji, Ali .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :4558-4567
[7]  
Fan DP, 2018, PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P698
[8]   Rethinking RGB-D Salient Object Detection: Models, Data Sets, and Large-Scale Benchmarks [J].
Fan, Deng-Ping ;
Lin, Zheng ;
Zhang, Zhao ;
Zhu, Menglong ;
Cheng, Ming-Ming .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (05) :2075-2089
[9]  
Glorot X., 2010, J MACH LEARN RES P T, V9, P249
[10]   Cross-Modal Fusion and Progressive Decoding Network for RGB-D Salient Object Detection [J].
Hu, Xihang ;
Sun, Fuming ;
Sun, Jing ;
Wang, Fasheng ;
Li, Haojie .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (08) :3067-3085