Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts

被引：0

作者：

Yang, Yanting ^{[1
]}

Chen, Minghao ^{[2
]}

Qiu, Qibo ^{[3
,7
]}

Wu, Jiahao ^{[4
]}

Wang, Wenxiao ^{[1
]}

Lin, Binbin ^{[1
,5
]}

Guan, Ziyu ^{[6
]}

He, Xiaofei ^{[7
]}

机构：

[1] Zhejiang Univ, Sch Software Technol, Hangzhou, Peoples R China

[2] Hangzhou Dianzi Univ, Sch Comp Sci & Technol, Hangzhou, Peoples R China

[3] China Mobile Zhejiang Res & Innovat Inst, Hangzhou, Peoples R China

[4] Hong Kong Polytech Univ, Hung Hom, Hong Kong, Peoples R China

[5] Zhiyuan Res Inst, Beijing, Peoples R China

[6] Xidian Univ, Sch Comp Sci & Technol, Xian, Peoples R China

[7] Zhejiang Univ, State Key Lab CAD&CG, Hangzhou, Peoples R China

来源：

COMPUTER VISION-ECCV 2024, PT LVII | 2025年 / 15115卷

关键词：

D O I：

10.1007/978-3-031-72998-0_10

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

For a general-purpose robot to operate in reality, executing a broad range of instructions across various environments is imperative. Central to the reinforcement learning and planning for such robotic agents is a generalizable reward function. Recent advances in vision-language models, such as CLIP, have shown remarkable performance in the domain of deep learning, paving the way for open-domain visual recognition. However, collecting data on robots executing various language instructions across multiple environments remains a challenge. This paper aims to transfer video-language models with robust generalization into a generalizable language-conditioned reward function, only utilizing robot video data from a minimal amount of tasks in a singular environment. Unlike common robotic datasets used for training reward functions, human video-language datasets rarely contain trivial failure videos. To enhance the model's ability to distinguish between successful and failed robot executions, we cluster failure video features to enable the model to identify patterns within. For each cluster, we integrate a newly trained failure prompt into the text encoder to represent the corresponding failure mode. Our language-conditioned reward function shows outstanding generalization to new environments and new instructions for robot planning and reinforcement learning. [GRAPHICS] .

引用

页码：163 / 180

页数：18

共 63 条

[1] Abbeel P., 2004, INT C MACHINE LEARNI, P1
[2] Babaeizadeh M, 2021, arXiv
[3] Brohan A., arXiv, DOI arXiv:2307.15818
[4] Brohan A., 2023, arXiv, DOI arXiv:2212.06817
[5] Chen A. S., 2021, arXiv
[6] Das N., 2021, P 2020 C ROBOT LEARN, P1930
[7] Du YQ, 2023, Arxiv, DOI arXiv:2303.07280
[8] Ebert F, 2018, Arxiv, DOI [arXiv:1812.00568, 10.48550/arXiv.1812.00568]
[9] Fan LX, 2022, Arxiv, DOI [arXiv:2206.08853, DOI 10.48550/ARXIV.2206.08853]
[10] Finn C, 2016, PR MACH LEARN RES, V48

← 1 2 3 4 5 6 7 →