Weak Human Preference Supervision for Deep Reinforcement Learning

被引:27
|
作者
Cao, Zehong [1 ]
Wong, KaiChiu [2 ,3 ]
Lin, Chin-Teng [4 ,5 ]
机构
[1] Univ South Australia, STEM, Mawson Lakes Campus, Adelaide, SA 5095, Australia
[2] Univ Tasmania, Sch Informat & Commun Technol ICT, Hobart, Tas 7005, Australia
[3] MyState Bank, Hobart, Tas 7000, Australia
[4] Univ Technol Sydney, Australian Artificial Intelligence Inst AAII, Ultimo, NSW 2007, Australia
[5] Univ Technol Sydney, Sch Comp Sci, Ultimo, NSW 2007, Australia
基金
澳大利亚研究理事会;
关键词
Training; Trajectory; Task analysis; Robots; Supervised learning; Australia; Reinforcement learning; Deep reinforcement learning (RL); scaling; supervised learning; weak human preferences;
D O I
10.1109/TNNLS.2021.3084198
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The current reward learning from human preferences could be used to resolve complex reinforcement learning (RL) tasks without access to a reward function by defining a single fixed preference between pairs of trajectory segments. However, the judgment of preferences between trajectories is not dynamic and still requires human input over thousands of iterations. In this study, we proposed a weak human preference supervision framework, for which we developed a human preference scaling model that naturally reflects the human perception of the degree of weak choices between trajectories and established a human-demonstration estimator through supervised learning to generate the predicted preferences for reducing the number of human inputs. The proposed weak human preference supervision framework can effectively solve complex RL tasks and achieve higher cumulative rewards in simulated robot locomotion-MuJoCo games-relative to the single fixed human preferences. Furthermore, our established human-demonstration estimator requires human feedback only for less than 0.01% of the agent's interactions with the environment and significantly reduces the cost of human inputs by up to 30% compared with the existing approaches. To present the flexibility of our approach, we released a video (https://youtu.be/jQPe1OILT0M) showing comparisons of the behaviors of agents trained on different types of human input. We believe that our naturally inspired human preferences with weakly supervised learning are beneficial for precise reward learning and can be applied to state-of-the-art RL systems, such as human-autonomy teaming systems.
引用
收藏
页码:5369 / 5378
页数:10
相关论文
共 50 条
  • [1] Deep Reinforcement Learning for Weak Human Activity Localization
    Xu, Wanru
    Miao, Zhenjiang
    Yu, Jian
    Ji, Qiang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 1522 - 1535
  • [2] Weak Supervision for Fake News Detection via Reinforcement Learning
    Wang, Yaqing
    Yang, Weifeng
    Ma, Fenglong
    Jin Xu
    Bin Zhong
    Deng, Qiang
    Gao, Jing
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 516 - 523
  • [3] DEEP LEARNING FOR WEAK SUPERVISION OF DIABETIC RETINOPATHY ABNORMALITIES
    Ahmad, Maroof
    Kasukurthi, Nikhil
    Pande, Harshit
    2019 IEEE 16TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING (ISBI 2019), 2019, : 573 - 577
  • [4] Offline reward shaping with scaling human preference feedback for deep reinforcement learning
    Li, Jinfeng
    Luo, Biao
    Xu, Xiaodong
    Huang, Tingwen
    NEURAL NETWORKS, 2025, 181
  • [5] Reinforcement Replaces Supervision: Query focused Summarization using Deep Reinforcement Learning
    Nath, Swaroop
    Bhattacharyya, Pushpak
    Khadilkar, Harshad
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 15770 - 15789
  • [6] Fast Adaptation of Deep Reinforcement Learning-Based Navigation Skills to Human Preference
    Choi, Jinyoung
    Dance, Christopher
    Kim, Jung-eun
    Park, Kyung-sik
    Han, Jaehun
    Seo, Joonho
    Kim, Minsu
    2020 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2020, : 3363 - 3370
  • [7] Learning Deep Representations for Word Spotting Under Weak Supervision
    Gurjar, Neha
    Sudholt, Sebastian
    Fink, Gernot A.
    2018 13TH IAPR INTERNATIONAL WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS (DAS), 2018, : 7 - 12
  • [8] Improving Spatiotemporal Self-supervision by Deep Reinforcement Learning
    Buechler, Uta
    Brattoli, Biagio
    Ommer, Bjoern
    COMPUTER VISION - ECCV 2018, PT 15, 2018, 11219 : 797 - 814
  • [9] Doubly Weak Supervision of Deep Learning Models for Head CT
    Saab, Khaled
    Dunnmon, Jared
    Goldman, Roger
    Ratner, Alex
    Sagreiya, Hersh
    Re, Christopher
    Rubin, Daniel
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2019, PT III, 2019, 11766 : 811 - 819
  • [10] Learning Interpretable Negation Rules via Weak Supervision at Document Level: A Reinforcement Learning Approach
    Prollochs, Nicolas
    Feuerriegel, Stefan
    Neumann, Dirk
    2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 407 - 413