Learning Depth Representation From RGB-D Videos by Time-Aware Contrastive Pre-Training

被引：2

作者：

He, Zongtao ^{[1
]}

Wang, Liuyi ^{[1
]}

Dang, Ronghao ^{[1
]}

Li, Shu ^{[1
]}

Yan, Qingqing ^{[1
]}

Liu, Chengju ^{[1
]}

Chen, Qijun ^{[1
]}

机构：

[1] Tongji Univ, Robot & Artificial Intelligence Lab RAIL, Shanghai 201804, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2024年 / 34卷 / 06期

基金：

中国国家自然科学基金;

关键词：

Task analysis; Artificial intelligence; Videos; Training; Databases; Visualization; Feature extraction; Depth representation; pre-training methods; contrastive learning; embodied AI; NAVIGATION;

D O I：

10.1109/TCSVT.2023.3326373

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Existing end-to-end depth representation in embodied AI is often task-specific and lacks the benefits of emerging pre-training paradigm due to limited datasets and training techniques for RGB-D videos. To address the challenge of obtaining robust and generalized depth representation for embodied AI, we introduce a unified RGB-D video dataset (UniRGBD) and a novel time-aware contrastive (TAC) pre-training approach. UniRGBD addresses the scarcity of large-scale depth pre-training datasets by providing a comprehensive collection of data from diverse sources in a unified format, enabling convenient data loading and accommodating various data domains. We also design an RGB-Depth alignment evaluation procedure and introduce a novel Near-K accuracy metric to assess the scene understanding capability of the depth encoder. Then, the TAC pre-training approach fills the gap in depth pre-training methods suitable for RGB-D videos by leveraging the intrinsic similarity between temporally proximate frames. TAC incorporates a soft label design that acts as valid label noise, enhancing the depth semantic extraction and promoting diverse and generalized knowledge acquisition. Furthermore, the adjustments in perspective between temporally proximate frames facilitate the extraction of invariant and comprehensive features, enhancing the robustness of the learned depth representation. Additionally, the inclusion of temporal information stabilizes training gradients and enables spatio-temporal depth perception. Comprehensive evaluation of RGB-Depth alignment demonstrates the superiority of our approach over state-of-the-art methods. We also conduct uncertainty analysis and a novel zero-shot experiment to validate the robustness and generalization of the TAC approach. Moreover, our TAC pre-training demonstrates significant performance improvements in various embodied AI tasks, providing compelling evidence of its efficacy across diverse domains.

引用

页码：4143 / 4158

页数：16

共 72 条

[1] Anderson P, 2018, Arxiv, DOI [arXiv:1807.06757, 10.48550/ARXIV.1807.06757]
[2] Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
Anderson, Peter
Wu, Qi
Teney, Damien
Bruce, Jake
Johnson, Mark
Sunderhauf, Niko
Reid, Ian
Gould, Stephen
van den Hengel, Anton
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 3674 - 3683
[3] [Anonymous], 2014, P INT C NEUR INF PRO, DOI DOI 10.48550/ARXIV.1412.3555
[4] Batra D, 2020, Arxiv, DOI [arXiv:2006.13171, DOI 10.48550/ARXIV.2006.13171]
[5] Brown TB, 2020, ADV NEUR IN, V33
[6] A COMPUTATIONAL APPROACH TO EDGE-DETECTION
CANNY, J
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1986, 8 (06) : 679 - 698
[7] Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation
Chen, Shizhe
Guhur, Pierre-Louis
Tapaswi, Makarand
Schmid, Cordelia
Laptev, Ivan
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16516 - 16526
[8] Consistent Intra-Video Contrastive Learning With Asynchronous Long-Term Memory Bank
Chen, Zelin
Lin, Kun-Yu
Zheng, Wei-Shi
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (03) : 1168 - 1180
[9] ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes
Dai, Angela
Chang, Angel X.
Savva, Manolis
Halber, Maciej
Funkhouser, Thomas
Niessner, Matthias
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 2432 - 2443
[10] Dai ZH, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P2978

← 1 2 3 4 5 6 7 8 →