Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards

被引：0

作者：

Metcalf, Katherine ^{[1
]}

Sarabia, Miguel ^{[1
]}

Mackraz, Natalie ^{[1
]}

Theobald, Barry-John ^{[1
]}

机构：

[1] Apple, Cupertino, CA 95014 USA

来源：

CONFERENCE ON ROBOT LEARNING, VOL 229 | 2023年 / 229卷

关键词：

human-in-the-loop learning; preference-based RL; RLHF;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Preference-based reinforcement learning (PbRL) aligns a robot behavior with human preferences via a reward function learned from binary feedback over agent behaviors. We show that dynamics-aware reward functions improve the sample efficiency of PbRL by an order of magnitude. In our experiments we iterate between: (1) learning a dynamics-aware state-action representation z(sa) via a self-supervised temporal consistency task, and (2) bootstrapping the preference-based reward function from z(sa), which results in faster policy learning and better final policy performance. For example, on quadruped-walk, walker-walk, and cheetah-run, with 50 preference labels we achieve the same performance as existing approaches with 500 preference labels, and we recover 83% and 66% of ground truth reward policy performance versus only 38% and 21%. The performance gains demonstrate the benefits of explicitly learning a dynamics-aware reward model. Repo: https://github.com/apple/ml-reed.

引用

页数：49