Clio: Real-Time Task-Driven Open-Set 3D Scene Graphs

被引:3
作者
Maggio, Dominic [1 ]
Chang, Yun [1 ]
Hughes, Nathan [1 ]
Trang, Matthew [2 ]
Griffith, Dan [2 ]
Dougherty, Carlyn [2 ]
Cristofalo, Eric [2 ]
Schmid, Lukas [1 ]
Carlone, Luca [1 ]
机构
[1] MIT, Lab Informat & Decis Syst, Cambridge, MA 02139 USA
[2] Lincoln Lab, MIT, Lexington, MA 02421 USA
来源
IEEE ROBOTICS AND AUTOMATION LETTERS | 2024年 / 9卷 / 10期
基金
瑞士国家科学基金会; 芬兰科学院;
关键词
Task analysis; Three-dimensional displays; Semantics; Robots; Real-time systems; Image segmentation; Natural languages; Mapping; deep learning for visual perception; semantic scene understanding; ABSTRACTIONS; PERCEPTION; AGENTS;
D O I
10.1109/LRA.2024.3451395
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
Modern tools for class-agnostic image segmentation (e.g., SegmentAnything) and open-set semantic understanding (e.g., CLIP) provide unprecedented opportunities for robot perception and mapping. While traditional closed-set metric-semantic maps were restricted to tens or hundreds of semantic classes, we can now build maps with a plethora of objects and countless semantic variations. This leaves us with a fundamental question: what is the right granularity for the objects (and, more generally, for the semantic concepts) the robot has to include in its map representation? While related work implicitly chooses a level of granularity by tuning thresholds for object detection, we argue that such a choice is intrinsically task-dependent. The first contribution of this paper is to propose a task-driven 3D scene understanding problem, where the robot is given a list of tasks in natural language, and has to select the granularity and the subset of objects and scene structure to retain in its map that is sufficient to complete the tasks. We show that this problem can be naturally formulated using the Information Bottleneck (IB), an established information-theoretic framework to discuss task-relevance. The second contribution is an algorithm for task-driven 3D scene understanding based on an Agglomerative IB approach, that is able to cluster 3D primitives in the environment into task-relevant objects and regions. The third contribution is to integrate our task-driven clustering algorithm into a real-time pipeline, named Clio, that constructs a hierarchical 3D scene graph of the environment online and using only onboard compute. Our final contribution is an extensive experimental campaign showing that Clio not only allows real-time construction of compact open-set 3D scene graphs, but also improves the accuracy of task execution by limiting the map to relevant semantic concepts.
引用
收藏
页码:8921 / 8928
页数:8
相关论文
共 63 条
  • [1] 3D Scene Graph: A structure for unified semantics, 3D space, and camera
    Armeni, Iro
    He, Zhi-Yang
    Gwak, JunYoung
    Zamir, Amir R.
    Fischer, Martin
    Malik, Jitendra
    Savarese, Silvio
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 5663 - 5672
  • [2] Neural Implicit Vision-Language Feature Fields
    Blomqvist, Kenneth
    Milano, Francesco
    Chung, Jen Jen
    Ott, Lionel
    Siegwart, Roland
    [J]. 2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS, 2023, : 1313 - 1318
  • [3] Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age
    Cadena, Cesar
    Carlone, Luca
    Carrillo, Henry
    Latif, Yasir
    Scaramuzza, Davide
    Neira, Jose
    Reid, Ian
    Leonard, John J.
    [J]. IEEE TRANSACTIONS ON ROBOTICS, 2016, 32 (06) : 1309 - 1332
  • [4] Chang H., 2023, P MACHINE LEARNING R, P1950
  • [5] Chang M, 2023, Arxiv, DOI arXiv:2311.06430
  • [6] Masked-attention Mask Transformer for Universal Image Segmentation
    Cheng, Bowen
    Misra, Ishan
    Schwing, Alexander G.
    Kirillov, Alexander
    Girdhar, Rohit
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 1280 - 1289
  • [7] Ding J., 2023, INT C MACHINE LEARNI, P8090
  • [8] Eftekhar A., 2024, PROC INT C LEARN REP
  • [9] Firoozi R, 2023, Arxiv, DOI arXiv:2312.07843
  • [10] Garg S., 2023, PROC 2 WORKSHOP LANG