Weak Human Preference Supervision for Deep Reinforcement Learning

被引:27
|
作者
Cao, Zehong [1 ]
Wong, KaiChiu [2 ,3 ]
Lin, Chin-Teng [4 ,5 ]
机构
[1] Univ South Australia, STEM, Mawson Lakes Campus, Adelaide, SA 5095, Australia
[2] Univ Tasmania, Sch Informat & Commun Technol ICT, Hobart, Tas 7005, Australia
[3] MyState Bank, Hobart, Tas 7000, Australia
[4] Univ Technol Sydney, Australian Artificial Intelligence Inst AAII, Ultimo, NSW 2007, Australia
[5] Univ Technol Sydney, Sch Comp Sci, Ultimo, NSW 2007, Australia
基金
澳大利亚研究理事会;
关键词
Training; Trajectory; Task analysis; Robots; Supervised learning; Australia; Reinforcement learning; Deep reinforcement learning (RL); scaling; supervised learning; weak human preferences;
D O I
10.1109/TNNLS.2021.3084198
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The current reward learning from human preferences could be used to resolve complex reinforcement learning (RL) tasks without access to a reward function by defining a single fixed preference between pairs of trajectory segments. However, the judgment of preferences between trajectories is not dynamic and still requires human input over thousands of iterations. In this study, we proposed a weak human preference supervision framework, for which we developed a human preference scaling model that naturally reflects the human perception of the degree of weak choices between trajectories and established a human-demonstration estimator through supervised learning to generate the predicted preferences for reducing the number of human inputs. The proposed weak human preference supervision framework can effectively solve complex RL tasks and achieve higher cumulative rewards in simulated robot locomotion-MuJoCo games-relative to the single fixed human preferences. Furthermore, our established human-demonstration estimator requires human feedback only for less than 0.01% of the agent's interactions with the environment and significantly reduces the cost of human inputs by up to 30% compared with the existing approaches. To present the flexibility of our approach, we released a video (https://youtu.be/jQPe1OILT0M) showing comparisons of the behaviors of agents trained on different types of human input. We believe that our naturally inspired human preferences with weakly supervised learning are beneficial for precise reward learning and can be applied to state-of-the-art RL systems, such as human-autonomy teaming systems.
引用
收藏
页码:5369 / 5378
页数:10
相关论文
共 50 条
  • [21] Human-Interactive Subgoal Supervision for Efficient Inverse Reinforcement Learning
    Pan, Xinlei
    Shen, Yilin
    PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON AUTONOMOUS AGENTS AND MULTIAGENT SYSTEMS (AAMAS' 18), 2018, : 1380 - 1387
  • [22] Deep Reinforcement Learning for Quantum State Preparation with Weak Nonlinear Measurements
    Porotti, Riccardo
    Essig, Antoine
    Huard, Benjamin
    Marquardt, Florian
    QUANTUM, 2022, 6
  • [23] Reinforcement Learning Based Cardiac Ultrasound Video Summarization Using Weak Supervision and Proximity Reward
    Coban, Ali
    Guzel Turhan, Ceren
    Sarikaya, Duygu
    32ND IEEE SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, SIU 2024, 2024,
  • [24] Multi-objective deep reinforcement learning for crowd-aware robot navigation with dynamic human preference
    Guangran Cheng
    Yuanda Wang
    Lu Dong
    Wenzhe Cai
    Changyin Sun
    Neural Computing and Applications, 2023, 35 : 16247 - 16265
  • [25] Multi-objective deep reinforcement learning for crowd-aware robot navigation with dynamic human preference
    Cheng, Guangran
    Wang, Yuanda
    Dong, Lu
    Cai, Wenzhe
    Sun, Changyin
    NEURAL COMPUTING & APPLICATIONS, 2023, 35 (22): : 16247 - 16265
  • [26] Learning to Navigate in Human Environments via Deep Reinforcement Learning
    Gao, Xingyuan
    Sun, Shiying
    Zhao, Xiaoguang
    Tan, Min
    NEURAL INFORMATION PROCESSING (ICONIP 2019), PT I, 2019, 11953 : 418 - 429
  • [27] Active and Incremental Learning with Weak Supervision
    Brust, Clemens-Alexander
    Kaeding, Christoph
    Denzler, Joachim
    KUNSTLICHE INTELLIGENZ, 2020, 34 (02): : 165 - 180
  • [28] Weak Supervision for Learning Discourse Structure
    Badene, Sonia
    Thompson, Kate
    Lorre, Jean-Pierre
    Asher, Nicholas
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 2296 - 2305
  • [29] Policy Learning Using Weak Supervision
    Wang, Jingkang
    Guo, Hongyi
    Zhu, Zhaowei
    Liu, Yang
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [30] Active and Incremental Learning with Weak Supervision
    Clemens-Alexander Brust
    Christoph Käding
    Joachim Denzler
    KI - Künstliche Intelligenz, 2020, 34 : 165 - 180