PCTDepth: Exploiting Parallel CNNs and Transformer via Dual Attention for Monocular Depth Estimation

被引:3
作者
Xia, Chenxing [1 ,2 ,3 ]
Duan, Xiuzhen [1 ]
Gao, Xiuju [4 ]
Ge, Bin [1 ]
Li, Kuan-Ching [5 ]
Fang, Xianjin [1 ,6 ]
Zhang, Yan [7 ]
Yang, Ke [2 ]
机构
[1] Anhui Univ Sci & Technol, Coll Comp Sci & Engn, Huainan 232001, Peoples R China
[2] Hefei Comprehens Natl Sci Ctr, Inst Energy, Hefei, Peoples R China
[3] Anhui Purvar Bigdata Technol Co Ltd, Huainan 232001, Peoples R China
[4] Anhui Univ Sci & Technol, Coll Elect & Informat Engn, Huainan 232001, Peoples R China
[5] Providence Univ, Dept Comp Sci & Informat Engn, Taichung, Taiwan
[6] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei, Peoples R China
[7] Anhui Univ, Sch Elect & Informat Engn, Hefei, Peoples R China
基金
中国国家自然科学基金;
关键词
Monocular depth estimation; Hierarchical interaction fusion; Transformer; CNNs; Attention; NETWORK; SHAPE;
D O I
10.1007/s11063-024-11524-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Monocular depth estimation (MDE) has made great progress with the development of convolutional neural networks (CNNs). However, these approaches suffer from essential shortsightedness due to the utilization of insufficient feature-based reasoning. To this end, we propose an effective parallel CNNs and Transformer model for MDE via dual attention (PCTDepth). Specifically, we use two stream backbones to extract features, where ResNet and Swin Transformer are utilized to obtain local detail features and global long-range dependencies, respectively. Furthermore, a hierarchical fusion module (HFM) is designed to actively exchange beneficial information for the complementation of each representation during the intermediate fusion. Finally, a dual attention module is incorporated for each fused feature in the decoder stage to improve the accuracy of the model by enhancing inter-channel correlations and focusing on relevant spatial locations. Comprehensive experiments on the KITTI dataset demonstrate that the proposed model consistently outperforms the other state-of-the-art methods.
引用
收藏
页数:21
相关论文
共 63 条
[1]  
Alhashim I, 2019, Arxiv, DOI [arXiv:1812.11941, DOI 10.48550/ARXIV.1812.11941]
[2]   SHAPE FROM TEXTURE [J].
ALOIMONOS, J .
BIOLOGICAL CYBERNETICS, 1988, 58 (05) :345-360
[3]  
Battiato S, 2004, 2ND INTERNATIONAL SYMPOSIUM ON 3D DATA PROCESSING, VISUALIZATION, AND TRANSMISSION, PROCEEDINGS, P124
[4]   AdaBins: Depth Estimation Using Adaptive Bins [J].
Bhat, Shariq Farooq ;
Alhashim, Ibraheem ;
Wonka, Peter .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :4008-4017
[5]   Combining Pixel-Level and Structure-Level Adaptation for Semantic Segmentation [J].
Bi, Xiwen ;
Chen, Dubing ;
Huang, He ;
Wang, Shidong ;
Zhang, Haofeng .
NEURAL PROCESSING LETTERS, 2023, 55 (07) :9669-9684
[6]   Monocular Depth Estimation With Augmented Ordinal Depth Relationships [J].
Cao, Yuanzhouhan ;
Zhao, Tianqi ;
Xian, Ke ;
Shen, Chunhua ;
Cao, Zhiguo ;
Xu, Shugong .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (08) :2674-2682
[7]   Estimating Depth From Monocular Images as Classification Using Deep Fully Convolutional Residual Networks [J].
Cao, Yuanzhouhan ;
Wu, Zifeng ;
Shen, Chunhua .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2018, 28 (11) :3174-3182
[8]   Attention to Scale: Scale-aware Semantic Image Segmentation [J].
Chen, Liang-Chieh ;
Yang, Yi ;
Wang, Jiang ;
Xu, Wei ;
Yuille, Alan L. .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :3640-3649
[9]  
Chen XT, 2019, Arxiv, DOI arXiv:1907.06023
[10]   Attention-based context aggregation network for monocular depth estimation [J].
Chen, Yuru ;
Zhao, Haitao ;
Hu, Zhengwei ;
Peng, Jingchao .
INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2021, 12 (06) :1583-1596