Learning Depth Representation From RGB-D Videos by Time-Aware Contrastive Pre-Training

被引:2
作者
He, Zongtao [1 ]
Wang, Liuyi [1 ]
Dang, Ronghao [1 ]
Li, Shu [1 ]
Yan, Qingqing [1 ]
Liu, Chengju [1 ]
Chen, Qijun [1 ]
机构
[1] Tongji Univ, Robot & Artificial Intelligence Lab RAIL, Shanghai 201804, Peoples R China
基金
中国国家自然科学基金;
关键词
Task analysis; Artificial intelligence; Videos; Training; Databases; Visualization; Feature extraction; Depth representation; pre-training methods; contrastive learning; embodied AI; NAVIGATION;
D O I
10.1109/TCSVT.2023.3326373
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Existing end-to-end depth representation in embodied AI is often task-specific and lacks the benefits of emerging pre-training paradigm due to limited datasets and training techniques for RGB-D videos. To address the challenge of obtaining robust and generalized depth representation for embodied AI, we introduce a unified RGB-D video dataset (UniRGBD) and a novel time-aware contrastive (TAC) pre-training approach. UniRGBD addresses the scarcity of large-scale depth pre-training datasets by providing a comprehensive collection of data from diverse sources in a unified format, enabling convenient data loading and accommodating various data domains. We also design an RGB-Depth alignment evaluation procedure and introduce a novel Near-K accuracy metric to assess the scene understanding capability of the depth encoder. Then, the TAC pre-training approach fills the gap in depth pre-training methods suitable for RGB-D videos by leveraging the intrinsic similarity between temporally proximate frames. TAC incorporates a soft label design that acts as valid label noise, enhancing the depth semantic extraction and promoting diverse and generalized knowledge acquisition. Furthermore, the adjustments in perspective between temporally proximate frames facilitate the extraction of invariant and comprehensive features, enhancing the robustness of the learned depth representation. Additionally, the inclusion of temporal information stabilizes training gradients and enables spatio-temporal depth perception. Comprehensive evaluation of RGB-Depth alignment demonstrates the superiority of our approach over state-of-the-art methods. We also conduct uncertainty analysis and a novel zero-shot experiment to validate the robustness and generalization of the TAC approach. Moreover, our TAC pre-training demonstrates significant performance improvements in various embodied AI tasks, providing compelling evidence of its efficacy across diverse domains.
引用
收藏
页码:4143 / 4158
页数:16
相关论文
共 72 条
  • [1] Anderson P, 2018, Arxiv, DOI [arXiv:1807.06757, 10.48550/ARXIV.1807.06757]
  • [2] Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
    Anderson, Peter
    Wu, Qi
    Teney, Damien
    Bruce, Jake
    Johnson, Mark
    Sunderhauf, Niko
    Reid, Ian
    Gould, Stephen
    van den Hengel, Anton
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 3674 - 3683
  • [3] [Anonymous], 2014, P INT C NEUR INF PRO, DOI DOI 10.48550/ARXIV.1412.3555
  • [4] Batra D, 2020, Arxiv, DOI [arXiv:2006.13171, DOI 10.48550/ARXIV.2006.13171]
  • [5] Brown TB, 2020, ADV NEUR IN, V33
  • [7] Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation
    Chen, Shizhe
    Guhur, Pierre-Louis
    Tapaswi, Makarand
    Schmid, Cordelia
    Laptev, Ivan
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16516 - 16526
  • [8] Consistent Intra-Video Contrastive Learning With Asynchronous Long-Term Memory Bank
    Chen, Zelin
    Lin, Kun-Yu
    Zheng, Wei-Shi
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (03) : 1168 - 1180
  • [9] ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes
    Dai, Angela
    Chang, Angel X.
    Savva, Manolis
    Halber, Maciej
    Funkhouser, Thomas
    Niessner, Matthias
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 2432 - 2443
  • [10] Dai ZH, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P2978