Transformers in Unsupervised Structure-from-Motion

被引:0
作者
Chawla, Hemang [1 ,2 ]
Varma, Arnav [1 ]
Arani, Elahe [1 ,2 ]
Zonooz, Bahram [1 ,2 ]
机构
[1] NavInfo Europe, Adv Res Lab, Eindhoven, Netherlands
[2] Eindhoven Univ Technol, Dept Math & Comp Sci, Eindhoven, Netherlands
来源
COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS, VISIGRAPP 2022 | 2023年 / 1815卷
关键词
Structure-from-motion; Monocular depth estimation; Monocular pose estimation; Camera calibration; Natural corruptions; Adversarial attacks; VISION;
D O I
10.1007/978-3-031-45725-8_14
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformers have revolutionized deep learning based computer vision with improved performance as well as robustness to natural corruptions and adversarial attacks. Transformers are used predominantly for 2D vision tasks, including image classification, semantic segmentation, and object detection. However, robots and advanced driver assistance systems also require 3D scene understanding for decision making by extracting structure-from-motion (SfM). We propose a robust transformer-based monocular SfM method that learns to predict monocular pixel-wise depth, ego vehicle's translation and rotation, as well as camera's focal length and principal point, simultaneously. With experiments on KITTI and DDAD datasets, we demonstrate how to adapt different vision transformers and compare them against contemporary CNN-based methods. Our study shows that transformer-based architecture, though lower in run-time efficiency, achieves comparable performance while being more robust against natural corruptions, as well as untargeted and targeted attacks. (Code: https://github.com/NeurAI-Lab/MT-SfMLearner).
引用
收藏
页码:281 / 303
页数:23
相关论文
共 60 条
  • [1] Bae J, 2023, AAAI CONF ARTIF INTE, P187
  • [2] Understanding Robustness of Transformers for Image Classification
    Bhojanapalli, Srinadh
    Chakrabarti, Ayan
    Glasner, Daniel
    Li, Daliang
    Unterthiner, Thomas
    Veit, Andreas
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 10211 - 10221
  • [3] Bian JW, 2019, ADV NEUR IN, V32
  • [4] Unsupervised Scale-Consistent Depth Learning from Video
    Bian, Jia-Wang
    Zhan, Huangying
    Wang, Naiyan
    Li, Zhichao
    Zhang, Le
    Shen, Chunhua
    Cheng, Ming-Ming
    Reid, Ian
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2021, 129 (09) : 2548 - 2564
  • [5] Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
  • [6] Emerging Properties in Self-Supervised Vision Transformers
    Caron, Mathilde
    Touvron, Hugo
    Misra, Ishan
    Jegou, Herve
    Mairal, Julien
    Bojanowski, Piotr
    Joulin, Armand
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 9630 - 9640
  • [7] Crowdsourced 3D Mapping: A Combined Multi-View Geometry and Self-Supervised Learning Approach
    Chaw, Hemang
    Jukola, Matai
    Brouns, Terence
    Arani, Elahe
    Zonooz, Bahram
    [J]. 2020 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2020, : 4750 - 4757
  • [8] Multimodal Scale Consistency and Awareness for Monocular Self-Supervised Depth Estimation
    Chawla, Hemang
    Varma, Arnav
    Arani, Elahe
    Zonooz, Bahram
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021), 2021, : 5140 - 5146
  • [9] Cicek Safa, 2020, ADV NEUR IN, V33
  • [10] Croce F, 2022, Arxiv, DOI arXiv:2209.06953