Inverse Preference Learning: Preference-based RL without a Reward Function

被引：0

作者：

Hejna, Joey ^{[1
]}

Sadigh, Dorsa ^{[1
]}

机构：

[1] Stanford Univ, Stanford, CA 94305 USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Reward functions are difficult to design and often hard to align with human intent. Preference-based Reinforcement Learning (RL) algorithms address these problems by learning reward functions from human feedback. However, the majority of preference-based RL methods naively combine supervised reward models with off-the-shelf RL algorithms. Contemporary approaches have sought to improve performance and query complexity by using larger and more complex reward architectures such as transformers. Instead of using highly complex architectures, we develop a new and parameter-efficient algorithm, Inverse Preference Learning (IPL), specifically designed for learning from offline preference data. Our key insight is that for a fixed policy, the.. -function encodes all information about the reward function, effectively making them interchangeable. Using this insight, we completely eliminate the need for a learned reward function. Our resulting algorithm is simpler and more parameter-efficient. Across a suite of continuous control and robotics benchmarks, IPL attains competitive performance compared to more complex approaches that leverage transformer-based and non-Markovian reward functions while having fewer algorithmic hyperparameters and learned network parameters. Our code is publicly released.

引用

页数：22

共 59 条

[1]

Abbeel P., 2004, Proceedings of 21st International conference on machine learning

[2] Keyframe-based Learning from Demonstration Method and Evaluation [J].

Akgun, Baris ;

Cakmak, Maya ;

Jiang, Karl ;

Thomaz, Andrea L. .

INTERNATIONAL JOURNAL OF SOCIAL ROBOTICS, 2012, 4 (04) :343-355

[3]

Akrour Riad, 2011, JOINT EUR C MACH LEA

[4]

Al-Hafez Firas, 2023, 11 INT C LEARN REPR

[5]

Amodei Dario, 2016, Concrete problems in ai safety

[6]

[Anonymous], 2018, INT C MACH LEARN

[7]

Bacchus Fahiem, 1996, NAT C ART INT

[8] Do You Want Your Autonomous Car To Drive Like You? [J].

Basu, Chandrayee ;

Yang, Qian ;

Hungerman, David ;

Singhal, Mukesh ;

Dragan, Anca D. .

PROCEEDINGS OF THE 2017 ACM/IEEE INTERNATIONAL CONFERENCE ON HUMAN-ROBOT INTERACTION (HRI'17), 2017, :417-425

[9]

Biyik E., 2020, P ROB SCI SYST RSS

[10]

Biyik E, 2019, IEEE DECIS CONTR P, P347, DOI 10.1109/CDC40024.2019.9030169

← 1 2 3 4 5 6 →