Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards

被引:0
作者
Metcalf, Katherine [1 ]
Sarabia, Miguel [1 ]
Mackraz, Natalie [1 ]
Theobald, Barry-John [1 ]
机构
[1] Apple, Cupertino, CA 95014 USA
来源
CONFERENCE ON ROBOT LEARNING, VOL 229 | 2023年 / 229卷
关键词
human-in-the-loop learning; preference-based RL; RLHF;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Preference-based reinforcement learning (PbRL) aligns a robot behavior with human preferences via a reward function learned from binary feedback over agent behaviors. We show that dynamics-aware reward functions improve the sample efficiency of PbRL by an order of magnitude. In our experiments we iterate between: (1) learning a dynamics-aware state-action representation z(sa) via a self-supervised temporal consistency task, and (2) bootstrapping the preference-based reward function from z(sa), which results in faster policy learning and better final policy performance. For example, on quadruped-walk, walker-walk, and cheetah-run, with 50 preference labels we achieve the same performance as existing approaches with 500 preference labels, and we recover 83% and 66% of ground truth reward policy performance versus only 38% and 21%. The performance gains demonstrate the benefits of explicitly learning a dynamics-aware reward model. Repo: https://github.com/apple/ml-reed.
引用
收藏
页数:49
相关论文
empty
未找到相关数据