To Boost Zero-Shot Generalization for Embodied Reasoning With Vision-Language Pre-Training

被引：0

作者：

Su, Ke ^{[1
]}

Zhang, Xingxing ^{[1
]}

Zhang, Siyang ^{[2
]}

Zhu, Jun ^{[1
,3
,4
]}

Zhang, Bo ^{[1
]}

机构：

[1] Tsinghua Univ, Inst AI, Tsinghua Bosch Joint ML Ctr, BNRist Ctr,Dept Comp Sci & Technol,THBI Lab, Beijing 100084, Peoples R China

[2] Nankai Univ, Sch Artificial Intelligence, Tianjin 300071, Peoples R China

[3] Peng Cheng Lab, Shenzhen 518066, Peoples R China

[4] Pazhou Lab Huangpu, Guangzhou 510700, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2024年 / 33卷

关键词：

Cognition; Visualization; Artificial intelligence; Training; Three-dimensional displays; Image reconstruction; Navigation; Embodied artificial intelligence; embodied reasoning; zero-shot generalization; vision-language pre-training;

D O I：

10.1109/TIP.2024.3459800

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, there exists an increased research interest in embodied artificial intelligence (EAI), which involves an agent learning to perform a specific task when dynamically interacting with the surrounding 3D environment. There into, a new challenge is that many unseen objects may appear due to the increased number of object categories in 3D scenes. It makes developing models with strong zero-shot generalization ability to new objects necessary. Existing work tries to achieve this goal by providing embodied agents with massive high-quality human annotations closely related to the task to be learned, while it is too costly in practice. Inspired by recent advances in pre-trained models in 2D visual tasks, we attempt to boost zero-shot generalization for embodied reasoning with vision-language pre-training that can encode common sense as general prior knowledge. To further improve its performance on a specific task, we rectify the pre-trained representation through masked scene graph modeling (MSGM) in a self-supervised manner, where the task-specific knowledge is learned from iterative message passing. Our method can improve a variety of representative embodied reasoning tasks by a large margin (e.g., over 5.0% w.r.t. answer accuracy on MP3D-EQA dataset that consists of many real-world scenes with a large number of new objects during testing), and achieve the new state-of-the-art performance.

引用

页码：5370 / 5381

页数：12

共 27 条

[21] Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model
Cheng, Kanzhi
Song, Wenpo
Ma, Zheng
Zhu, Wenhao
Zhu, Zixuan
Zhang, Jianbing
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5038 - 5047
[22] ClusterE-ZSL: A Novel Cluster-Based Embedding for Enhanced Zero-Shot Learning in Contrastive Pre-Training Cross-Modal Retrieval
Tariq, Umair
Hu, Zonghai
Tasneem, Khawaja Tauseef
Bin Heyat, Md Belal
Iqbal, Muhammad Shahid
Aziz, Kamran
IEEE ACCESS, 2024, 12 : 162622 - 162637
[23] Retrieval-based Knowledge Augmented Vision Language Pre-training
Rao, Jiahua
Shan, Zifei
Liu, Longpo
Zhou, Yao
Yang, Yuedong
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5399 - 5409
[24] A3R: Vision Language Pre-training by Attentive Alignment and Attentive Reconstruction
Hu, Yusong
Gao, Yuting
Xu, Zihan
Li, Ke
Liu, Xialei
PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 129 - 142
[25] ZEN-IQA: Zero-Shot Explainable and No-Reference Image Quality Assessment With Vision Language Model
Miyata, Takamichi
IEEE ACCESS, 2024, 12 : 70973 - 70983
[26] AttriPrompter: Auto-Prompting With Attribute Semantics for Zero-Shot Nuclei Detection via Visual-Language Pre-Trained Models
Wu, Yongjian
Zhou, Yang
Saiyin, Jiya
Wei, Bingzheng
Lai, Maode
Shou, Jianzhong
Xu, Yan
IEEE TRANSACTIONS ON MEDICAL IMAGING, 2025, 44 (02) : 982 - 993
[27] HOP plus : History-Enhanced and Order-Aware Pre-Training for Vision-and-Language Navigation
Qiao, Yanyuan
Qi, Yuankai
Hong, Yicong
Yu, Zheng
Wang, Peng
Wu, Qi
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (07) : 8524 - 8537

← 1 2 3 →