Spike Transformer: Monocular Depth Estimation for Spiking Camera

被引:17
作者
Zhang, Jiyuan [1 ]
Tang, Lulu [2 ,3 ]
Yu, Zhaofei [1 ]
Lu, Jiwen [3 ]
Huang, Tiejun [1 ,2 ]
机构
[1] Peking Univ, Dept Comp Sci, Beijing, Peoples R China
[2] Beijing Acad Artificial Intelligence, Beijing, Peoples R China
[3] Tsinghua Univ, Dept Automat, Beijing, Peoples R China
来源
COMPUTER VISION, ECCV 2022, PT VII | 2022年 / 13667卷
基金
中国国家自然科学基金;
关键词
Depth estimation; Transformer; Spiking camera; Spike data;
D O I
10.1007/978-3-031-20071-7_3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Spiking camera is a bio-inspired vision sensor that mimics the sampling mechanism of the primate fovea, which has shown great potential for capturing high-speed dynamic scenes with a sampling rate of 40,000Hz. Unlike conventional digital cameras, the spiking camera continuously captures photons and outputs asynchronous binary spikes that encode time, location, and light intensity. Because of the different sampling mechanisms, the off-the-shelf image-based algorithms for digital cameras are unsuitable for spike streams generated by the spiking camera. Therefore, it is of particular interest to develop novel, spike-aware algorithms for common computer vision tasks. In this paper, we focus on the depth estimation task, which is challenging due to the natural properties of spike streams, such as irregularity, continuity, and spatialtemporal correlation, and has not been explored for the spiking camera. We present Spike Transformer (Spike-T), a novel paradigm for learning spike data and estimating monocular depth from continuous spike streams. To fit spike data to Transformer, we present an input spike embedding equipped with a spatio-temporal patch partition module to maintain features from both spatial and temporal domains. Furthermore, we build two spike-based depth datasets. One is synthetic, and the other is captured by a real spiking camera. Experimental results demonstrate that the proposed Spike-T can favorably predict the scene's depth and consistently outperform its direct competitors. More importantly, the representation learned by Spike-T transfers well to the unseen real data, indicating the generalization of Spike-T to real-world scenarios. To our best knowledge, this is the first time that directly depth estimation from spike streams becomes possible. Code and Datasets are available at https://github.com/Leozhangjiyuan/MDE-SpildngCamera.
引用
收藏
页码:34 / 52
页数:19
相关论文
共 73 条
[1]   ViViT: A Video Vision Transformer [J].
Arnab, Anurag ;
Dehghani, Mostafa ;
Heigold, Georg ;
Sun, Chen ;
Lucic, Mario ;
Schmid, Cordelia .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826
[2]  
Bao H., 2021, arXiv
[3]  
Baudron A, 2020, Arxiv, DOI [arXiv:2012.05214, DOI 10.48550/ARXIV.2012.05214]
[4]  
Bertasius G, 2021, Arxiv, DOI [arXiv:2102.05095, DOI 10.48550/ARXIV.2102.05095]
[5]   AdaBins: Depth Estimation Using Adaptive Bins [J].
Bhat, Shariq Farooq ;
Alhashim, Ibraheem ;
Wonka, Peter .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :4008-4017
[6]  
Brown TB, 2020, ADV NEUR IN, V33
[7]  
Chaney K., 2019, P IEEECVF C COMPUTER
[8]  
Devlin J, 2019, Arxiv, DOI [arXiv:1810.04805, 10.48550/arXiv.1810.04805]
[9]   An Efficient Coding Method for Spike Camera using Inter-Spike Intervals [J].
Dong, Siwei ;
Zhu, Lin ;
Xu, Daoyuan ;
Tian, Yonghong ;
Huang, Tiejun .
2019 DATA COMPRESSION CONFERENCE (DCC), 2019, :568-568
[10]   Spike Camera and Its Coding Methods [J].
Dong, Siwei ;
Huang, Tiejun ;
Tian, Yonghong .
2017 DATA COMPRESSION CONFERENCE (DCC), 2017, :437-437