To Boost Zero-Shot Generalization for Embodied Reasoning With Vision-Language Pre-Training

被引：0

作者：

Su, Ke ^{[1
]}

Zhang, Xingxing ^{[1
]}

Zhang, Siyang ^{[2
]}

Zhu, Jun ^{[1
,3
,4
]}

Zhang, Bo ^{[1
]}

机构：

[1] Tsinghua Univ, Inst AI, Tsinghua Bosch Joint ML Ctr, BNRist Ctr,Dept Comp Sci & Technol,THBI Lab, Beijing 100084, Peoples R China

[2] Nankai Univ, Sch Artificial Intelligence, Tianjin 300071, Peoples R China

[3] Peng Cheng Lab, Shenzhen 518066, Peoples R China

[4] Pazhou Lab Huangpu, Guangzhou 510700, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2024年 / 33卷

关键词：

Cognition; Visualization; Artificial intelligence; Training; Three-dimensional displays; Image reconstruction; Navigation; Embodied artificial intelligence; embodied reasoning; zero-shot generalization; vision-language pre-training;

D O I：

10.1109/TIP.2024.3459800

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, there exists an increased research interest in embodied artificial intelligence (EAI), which involves an agent learning to perform a specific task when dynamically interacting with the surrounding 3D environment. There into, a new challenge is that many unseen objects may appear due to the increased number of object categories in 3D scenes. It makes developing models with strong zero-shot generalization ability to new objects necessary. Existing work tries to achieve this goal by providing embodied agents with massive high-quality human annotations closely related to the task to be learned, while it is too costly in practice. Inspired by recent advances in pre-trained models in 2D visual tasks, we attempt to boost zero-shot generalization for embodied reasoning with vision-language pre-training that can encode common sense as general prior knowledge. To further improve its performance on a specific task, we rectify the pre-trained representation through masked scene graph modeling (MSGM) in a self-supervised manner, where the task-specific knowledge is learned from iterative message passing. Our method can improve a variety of representative embodied reasoning tasks by a large margin (e.g., over 5.0% w.r.t. answer accuracy on MP3D-EQA dataset that consists of many real-world scenes with a large number of new objects during testing), and achieve the new state-of-the-art performance.

引用

页码：5370 / 5381

页数：12

共 27 条

[1] IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-Training
Liu, Che
Cheng, Sibo
Shi, Miaojing
Shah, Anand
Bai, Wenjia
Arcucci, Rossella
IEEE TRANSACTIONS ON MEDICAL IMAGING, 2025, 44 (01) : 519 - 529
[2] Unsupervised Domain Adaption Harnessing Vision-Language Pre-Training
Zhou, Wenlve
Zhou, Zhiheng
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (09) : 8201 - 8214
[3] Enhancing medical text detection with vision-language pre-training and efficient segmentation
Li, Tianyang
Bai, Jinxu
Wang, Qingzhu
COMPLEX & INTELLIGENT SYSTEMS, 2024, 10 (03) : 3995 - 4007
[4] Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
Liu, Zikang
Chen, Sihan
Guo, Longteng
Li, Handong
He, Xingjian
Liu, Jing
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5120 - 5131
[5] Efficient Medical Images Text Detection with Vision-Language Pre-training Approach
Li, Tianyang
Bai, Jinxu
Wang, Qingzhu
Xu, Hanwen
ASIAN CONFERENCE ON MACHINE LEARNING, VOL 222, 2023, 222
[6] Multi-Task Paired Masking With Alignment Modeling for Medical Vision-Language Pre-Training
Zhang, Ke
Yang, Yan
Yu, Jun
Jiang, Hanliang
Fan, Jianping
Huang, Qingming
Han, Weidong
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 4706 - 4721
[7] Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models
Ruan, Shouwei
Dong, Yinpeng
Liu, Hanging
Huang, Yao
Su, Hang
Wei, Xingxing
COMPUTER VISION - ECCV 2024, PT XXVI, 2025, 15084 : 309 - 327
[8] Noise-Robust Vision-Language Pre-Training With Positive-Negative Learning
Huang, Zhenyu
Yang, Mouxing
Xiao, Xinyan
Hu, Peng
Peng, Xi
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2025, 47 (01) : 338 - 350
[9] Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting
Xue, Chuhui
Zhang, Wenqing
Hao, Yu
Lu, Shijian
Torr, Philip H. S.
Bai, Song
COMPUTER VISION - ECCV 2022, PT XXVIII, 2022, 13688 : 284 - 302
[10] Enhancing Visual Grounding in Vision-Language Pre-Training With Position-Guided Text Prompts
Wang, Alex Jinpeng
Zhou, Pan
Shou, Mike Zheng
Yan, Shuicheng
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (05) : 3406 - 3421

← 1 2 3 →