Scene-Driven Multimodal Knowledge Graph Construction for Embodied AI

被引：2

作者：

Song, Yaoxian ^{[1
]}

Sun, Penglei ^{[1
]}

Liu, Haoyu ^{[2
]}

Li, Zhixu ^{[1
]}

Song, Wei ^{[2
]}

Xiao, Yanghua ^{[1
]}

Zhou, Xiaofang ^{[3
]}

机构：

[1] Fudan Univ, Sch Comp Sci, Shanghai Key Lab Data Sci, Shanghai 200437, Peoples R China

[2] Zhejiang Lab, Res Ctr Intelligent Robot, Hangzhou 311121, Zhejiang, Peoples R China

[3] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China

来源：

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING | 2024年 / 36卷 / 11期

基金：

中国国家自然科学基金;

关键词：

Task analysis; Knowledge graphs; Artificial intelligence; Knowledge based systems; Robots; Knowledge engineering; Visualization; Multimodal knowledge graph; scene driven; embodied AI; robotic intelligence; LARGE-SCALE; LANGUAGE;

D O I：

10.1109/TKDE.2024.3399746

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Embodied AI is one of the most popular studies in artificial intelligence and robotics, which can effectively improve the intelligence of real-world agents (i.e. robots) serving human beings. Scene knowledge is important for an agent to understand the surroundings and make correct decisions in the varied open world. Currently, knowledge base for embodied tasks is missing and most existing work use general knowledge base or pre-trained models to enhance the intelligence of an agent. For conventional knowledge base, it is sparse, insufficient in capacity and cost in data collection. For pre-trained models, they face the uncertainty of knowledge and hard maintenance. To overcome the challenges of scene knowledge, we propose a scene-driven multimodal knowledge graph (Scene-MMKG) construction method combining conventional knowledge engineering and large language models. A unified scene knowledge injection framework is introduced for knowledge representation. To evaluate the advantages of our proposed method, we instantiate Scene-MMKG considering typical indoor robotic functionalities (Manipulation and Mobility), named ManipMob-MMKG. Comparisons in characteristics indicate our instantiated ManipMob-MMKG has broad superiority on data-collection efficiency and knowledge quality. Experimental results on typical embodied tasks show that knowledge-enhanced methods using our instantiated ManipMob-MMKG can improve the performance obviously without re-designing model structures complexly.

引用

页码：6962 / 6976

页数：15

共 102 条

[1] Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
Anderson, Peter
Wu, Qi
Teney, Damien
Bruce, Jake
Johnson, Mark
Sunderhauf, Niko
Reid, Ian
Gould, Stephen
van den Hengel, Anton
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 3674 - 3683
[2] 3D Scene Graph: A structure for unified semantics, 3D space, and camera
Armeni, Iro
He, Zhi-Yang
Gwak, JunYoung
Zamir, Amir R.
Fischer, Martin
Malik, Jitendra
Savarese, Silvio
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 5663 - 5672
[3] Bian N, 2024, Arxiv, DOI [arXiv:2303.16421, DOI 10.48550/ARXIV.2303.16421]
[4] Bollacker K., 2008, P ACM SIGMOD INT C M, P1247, DOI DOI 10.5555/1619797.1619981
[5] Brown TB, 2020, ADV NEUR IN, V33
[6] Chowdhery A, 2023, J MACH LEARN RES, V24
[7] Learning to Act Properly: Predicting and Explaining Affordances from Images
Chuang, Ching-Yao
Li, Jiaman
Torralba, Antonio
Fidler, Sanja
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 975 - 983
[8] Corona R, 2022, PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): (SHORT PAPERS), VOL 2, P54
[9] Dehghani Mostafa, P MACHINE LEARNING R
[10] Deitke M, 2022, Arxiv, DOI arXiv:2210.06849

← 1 2 3 4 5 6 7 8 9 10 →