Scene-Driven Multimodal Knowledge Graph Construction for Embodied AI

被引:2
作者
Song, Yaoxian [1 ]
Sun, Penglei [1 ]
Liu, Haoyu [2 ]
Li, Zhixu [1 ]
Song, Wei [2 ]
Xiao, Yanghua [1 ]
Zhou, Xiaofang [3 ]
机构
[1] Fudan Univ, Sch Comp Sci, Shanghai Key Lab Data Sci, Shanghai 200437, Peoples R China
[2] Zhejiang Lab, Res Ctr Intelligent Robot, Hangzhou 311121, Zhejiang, Peoples R China
[3] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
基金
中国国家自然科学基金;
关键词
Task analysis; Knowledge graphs; Artificial intelligence; Knowledge based systems; Robots; Knowledge engineering; Visualization; Multimodal knowledge graph; scene driven; embodied AI; robotic intelligence; LARGE-SCALE; LANGUAGE;
D O I
10.1109/TKDE.2024.3399746
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Embodied AI is one of the most popular studies in artificial intelligence and robotics, which can effectively improve the intelligence of real-world agents (i.e. robots) serving human beings. Scene knowledge is important for an agent to understand the surroundings and make correct decisions in the varied open world. Currently, knowledge base for embodied tasks is missing and most existing work use general knowledge base or pre-trained models to enhance the intelligence of an agent. For conventional knowledge base, it is sparse, insufficient in capacity and cost in data collection. For pre-trained models, they face the uncertainty of knowledge and hard maintenance. To overcome the challenges of scene knowledge, we propose a scene-driven multimodal knowledge graph (Scene-MMKG) construction method combining conventional knowledge engineering and large language models. A unified scene knowledge injection framework is introduced for knowledge representation. To evaluate the advantages of our proposed method, we instantiate Scene-MMKG considering typical indoor robotic functionalities (Manipulation and Mobility), named ManipMob-MMKG. Comparisons in characteristics indicate our instantiated ManipMob-MMKG has broad superiority on data-collection efficiency and knowledge quality. Experimental results on typical embodied tasks show that knowledge-enhanced methods using our instantiated ManipMob-MMKG can improve the performance obviously without re-designing model structures complexly.
引用
收藏
页码:6962 / 6976
页数:15
相关论文
共 102 条
  • [1] Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
    Anderson, Peter
    Wu, Qi
    Teney, Damien
    Bruce, Jake
    Johnson, Mark
    Sunderhauf, Niko
    Reid, Ian
    Gould, Stephen
    van den Hengel, Anton
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 3674 - 3683
  • [2] 3D Scene Graph: A structure for unified semantics, 3D space, and camera
    Armeni, Iro
    He, Zhi-Yang
    Gwak, JunYoung
    Zamir, Amir R.
    Fischer, Martin
    Malik, Jitendra
    Savarese, Silvio
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 5663 - 5672
  • [3] Bian N, 2024, Arxiv, DOI [arXiv:2303.16421, DOI 10.48550/ARXIV.2303.16421]
  • [4] Bollacker K., 2008, P ACM SIGMOD INT C M, P1247, DOI DOI 10.5555/1619797.1619981
  • [5] Brown TB, 2020, ADV NEUR IN, V33
  • [6] Chowdhery A, 2023, J MACH LEARN RES, V24
  • [7] Learning to Act Properly: Predicting and Explaining Affordances from Images
    Chuang, Ching-Yao
    Li, Jiaman
    Torralba, Antonio
    Fidler, Sanja
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 975 - 983
  • [8] Corona R, 2022, PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): (SHORT PAPERS), VOL 2, P54
  • [9] Dehghani Mostafa, P MACHINE LEARNING R
  • [10] Deitke M, 2022, Arxiv, DOI arXiv:2210.06849