To Boost Zero-Shot Generalization for Embodied Reasoning With Vision-Language Pre-Training

被引:0
|
作者
Su, Ke [1 ]
Zhang, Xingxing [1 ]
Zhang, Siyang [2 ]
Zhu, Jun [1 ,3 ,4 ]
Zhang, Bo [1 ]
机构
[1] Tsinghua Univ, Inst AI, Tsinghua Bosch Joint ML Ctr, BNRist Ctr,Dept Comp Sci & Technol,THBI Lab, Beijing 100084, Peoples R China
[2] Nankai Univ, Sch Artificial Intelligence, Tianjin 300071, Peoples R China
[3] Peng Cheng Lab, Shenzhen 518066, Peoples R China
[4] Pazhou Lab Huangpu, Guangzhou 510700, Peoples R China
关键词
Cognition; Visualization; Artificial intelligence; Training; Three-dimensional displays; Image reconstruction; Navigation; Embodied artificial intelligence; embodied reasoning; zero-shot generalization; vision-language pre-training;
D O I
10.1109/TIP.2024.3459800
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, there exists an increased research interest in embodied artificial intelligence (EAI), which involves an agent learning to perform a specific task when dynamically interacting with the surrounding 3D environment. There into, a new challenge is that many unseen objects may appear due to the increased number of object categories in 3D scenes. It makes developing models with strong zero-shot generalization ability to new objects necessary. Existing work tries to achieve this goal by providing embodied agents with massive high-quality human annotations closely related to the task to be learned, while it is too costly in practice. Inspired by recent advances in pre-trained models in 2D visual tasks, we attempt to boost zero-shot generalization for embodied reasoning with vision-language pre-training that can encode common sense as general prior knowledge. To further improve its performance on a specific task, we rectify the pre-trained representation through masked scene graph modeling (MSGM) in a self-supervised manner, where the task-specific knowledge is learned from iterative message passing. Our method can improve a variety of representative embodied reasoning tasks by a large margin (e.g., over 5.0% w.r.t. answer accuracy on MP3D-EQA dataset that consists of many real-world scenes with a large number of new objects during testing), and achieve the new state-of-the-art performance.
引用
收藏
页码:5370 / 5381
页数:12
相关论文
共 27 条
  • [21] Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model
    Cheng, Kanzhi
    Song, Wenpo
    Ma, Zheng
    Zhu, Wenhao
    Zhu, Zixuan
    Zhang, Jianbing
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5038 - 5047
  • [22] ClusterE-ZSL: A Novel Cluster-Based Embedding for Enhanced Zero-Shot Learning in Contrastive Pre-Training Cross-Modal Retrieval
    Tariq, Umair
    Hu, Zonghai
    Tasneem, Khawaja Tauseef
    Bin Heyat, Md Belal
    Iqbal, Muhammad Shahid
    Aziz, Kamran
    IEEE ACCESS, 2024, 12 : 162622 - 162637
  • [23] Retrieval-based Knowledge Augmented Vision Language Pre-training
    Rao, Jiahua
    Shan, Zifei
    Liu, Longpo
    Zhou, Yao
    Yang, Yuedong
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5399 - 5409
  • [24] A3R: Vision Language Pre-training by Attentive Alignment and Attentive Reconstruction
    Hu, Yusong
    Gao, Yuting
    Xu, Zihan
    Li, Ke
    Liu, Xialei
    PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 129 - 142
  • [25] ZEN-IQA: Zero-Shot Explainable and No-Reference Image Quality Assessment With Vision Language Model
    Miyata, Takamichi
    IEEE ACCESS, 2024, 12 : 70973 - 70983
  • [26] AttriPrompter: Auto-Prompting With Attribute Semantics for Zero-Shot Nuclei Detection via Visual-Language Pre-Trained Models
    Wu, Yongjian
    Zhou, Yang
    Saiyin, Jiya
    Wei, Bingzheng
    Lai, Maode
    Shou, Jianzhong
    Xu, Yan
    IEEE TRANSACTIONS ON MEDICAL IMAGING, 2025, 44 (02) : 982 - 993
  • [27] HOP plus : History-Enhanced and Order-Aware Pre-Training for Vision-and-Language Navigation
    Qiao, Yanyuan
    Qi, Yuankai
    Hong, Yicong
    Yu, Zheng
    Wang, Peng
    Wu, Qi
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (07) : 8524 - 8537