ViT-MPI: Vision Transformer Multiplane Images for Surgical Single-View View Synthesis

被引:1
作者
Han, Chenming [1 ]
Shao, Ruizhi [2 ]
Wu, Gaochang [1 ]
Shao, Hang [3 ]
Liu, Yebin [2 ]
机构
[1] Northeastern Univ, State Key Lab Synthet Automat Proc Ind, Shenyang, Peoples R China
[2] Tsinghua Univ, Dept Automat, Beijing, Peoples R China
[3] Zhejiang Future Technol Inst, Jiaxing, Peoples R China
来源
ARTIFICIAL INTELLIGENCE, CICAI 2023, PT I | 2024年 / 14473卷
关键词
View synthesis; Vision transformer; MPI representation; Endoscopic surgery;
D O I
10.1007/978-981-99-8850-1_3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we explore the use of a single imaging device to acquire immersive 3D perception in endoscopic surgery. To solve the heavily ill-posed problem caused by the unknown depth and unseen occlusion, we introduce a Vision Transformer (ViT)-based Multiplane Images (MPI) representation, termed as ViT-MPI, for the continuous novel view synthesis using single-view input. The MPI representation provides layered depth images to explicitly decode positional relationships between tissues. Instead of using the existing full convolutional network as the backbone of our MPI representation, we exploit the ViT architecture to collect tokens output from all stages of the transformer and combine them into feature representations with different resolutions. The interactions between tokens in the ViT provide accurate predictions of local and global positional relations, ensuring reliable view synthesis of occluded regions with fine-grained details. Experiments on real-captured endoscopic surgery images from the da Vinci Surgical Robot System demonstrate that our proposed approach enables the prediction of multi-view images from a single-view input. Moreover, our method produces reasonable depth maps, further enhancing its practical applicability.
引用
收藏
页码:28 / 40
页数:13
相关论文
共 32 条
[1]   Attention Augmented Convolutional Networks [J].
Bello, Irwan ;
Zoph, Barret ;
Vaswani, Ashish ;
Shlens, Jonathon ;
Le, Quoc V. .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :3285-3294
[2]  
Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
[3]  
Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
[4]   DeepView: View synthesis with learned gradient descent [J].
Flynn, John ;
Broxton, Michael ;
Debevec, Paul ;
DuVall, Matthew ;
Fyffe, Graham ;
Overbeck, Ryan ;
Snavely, Noah ;
Tucker, Richard .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :2362-2371
[5]   DeepStereo: Learning to Predict New Views from the World's Imagery [J].
Flynn, John ;
Neulander, Ivan ;
Philbin, James ;
Snavely, Noah .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :5515-5524
[6]   Deep Ordinal Regression Network for Monocular Depth Estimation [J].
Fu, Huan ;
Gong, Mingming ;
Wang, Chaohui ;
Batmanghelich, Kayhan ;
Tao, Dacheng .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :2002-2011
[7]   Robotics in general surgery - Personal experience in a large community hospital [J].
Giulianotti, PC ;
Coratti, A ;
Angelini, M ;
Sbrana, F ;
Cecconi, S ;
Balestracci, T ;
Caravaglios, G .
ARCHIVES OF SURGERY, 2003, 138 (07) :777-784
[8]  
Hedman P, 2018, SIGGRAPH ASIA'18: SIGGRAPH ASIA 2018 TECHNICAL PAPERS, DOI 10.1145/3272127.3275084
[9]   Casual 3D Photography [J].
Hedman, Peter ;
Alsisan, Suhib ;
Szeliski, Richard ;
Kopf, Johannes .
ACM TRANSACTIONS ON GRAPHICS, 2017, 36 (06)
[10]   Perceptual Losses for Real-Time Style Transfer and Super-Resolution [J].
Johnson, Justin ;
Alahi, Alexandre ;
Li Fei-Fei .
COMPUTER VISION - ECCV 2016, PT II, 2016, 9906 :694-711