Single-view 3D reconstruction via dual attention

被引:0
作者
Li, Chenghuan [1 ]
Xiao, Meihua [1 ]
Li, Zehuan [1 ]
Chen, Fangping [2 ]
Wang, Dingli [1 ]
机构
[1] East China Jiaotong Univ, Software Sch, Nanchang, Jiangxi, Peoples R China
[2] Jiangxi Univ Software Profess Technol, Nanchang, Jiangxi, Peoples R China
基金
中国国家自然科学基金;
关键词
3D reconstruction; Computer vision; Deep learning; Transformer; Selective state space model; Voxel model;
D O I
10.7717/peerj-cs.2403
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Constructing global context information and local fine-grained information simultaneously is extremely important for single-view 3D reconstruction. In this study, we propose a network that uses spatial dimension attention and channel dimension attention for single-view 3D reconstruction, named R3Davit. Specifically, R3Davit consists of an encoder and a decoder, where the encoder comes from the Davit backbone network. Different from the previous transformer backbone network, Davit focuses on spatial and channel dimensions, fully constructing global context information and local fine-grained information while maintaining linear complexity. To effectively learn features from dual attention and maintain the overall inference speed of the network, we do not use a self-attention layer in the decoder but design a decoder with a nonlinear reinforcement block, a selective state space model block, and an up-sampling Residual Block. The nonlinear enhancement block is used to enhance the nonlinear learning ability of the network. The Selective State Space Model Block replaces the role of the self-attention layer and maintains linear complexity. The up-sampling Residual Block converts voxel features into a voxel model while retaining the voxels of this layer. Features are used in the up-sampling block of the next layer. Experiments on the synthetic dataset ShapeNet and ShapeNetChairRFC with random background show that our method outperforms recent state of the art (SOTA) methods, we lead by 1% and 2% in IOU and F1 scores, respectively. Simultaneously, experiments on the real-world dataset Pix3d fully prove the robustness of our method. The code will be available at https://github.com/ epicgzs1112/R3Davit.
引用
收藏
页数:16
相关论文
共 32 条
[1]   Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields [J].
Barron, Jonathan T. ;
Mildenhall, Ben ;
Tancik, Matthew ;
Hedman, Peter ;
Martin-Brualla, Ricardo ;
Srinivasan, Pratul P. .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :5835-5844
[2]   3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction [J].
Choy, Christopher B. ;
Xu, Danfei ;
Gwak, Jun Young ;
Chen, Kevin ;
Savarese, Silvio .
COMPUTER VISION - ECCV 2016, PT VIII, 2016, 9912 :628-644
[3]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[4]   Depth-supervised NeRF: Fewer Views and Faster Training for Free [J].
Deng, Kangle ;
Liu, Andrew ;
Zhu, Jun-Yan ;
Ramanan, Deva .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, :12872-12881
[5]   DaViT: Dual Attention Vision Transformers [J].
Ding, Mingyu ;
Xiao, Bin ;
Codella, Noel ;
Luo, Ping ;
Wang, Jingdong ;
Yuan, Lu .
COMPUTER VISION, ECCV 2022, PT XXIV, 2022, 13684 :74-92
[6]  
Dosovitskiy A, 2021, Arxiv, DOI [arXiv:2010.11929, DOI 10.48550/ARXIV.2010.11929]
[7]   A Point Set Generation Network for 3D Object Reconstruction from a Single Image [J].
Fan, Haoqiang ;
Su, Hao ;
Guibas, Leonidas .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :2463-2471
[8]  
Gu A, 2024, Arxiv, DOI arXiv:2312.00752
[9]   DV-Net: Dual-view network for 3D reconstruction by fusing multiple sets of gated control point clouds [J].
Jia, Xin ;
Yang, Shourui ;
Peng, Yuxin ;
Zhang, Junchao ;
Chen, Shengyong .
PATTERN RECOGNITION LETTERS, 2020, 131 :376-382
[10]  
Kar A., 2017, P INT C NEUR INF PRO, P364