Reinforcement Learning with a Corrupted Reward Channel

被引：0

作者：

Everitt, Tom ^{[1
]}

Krakovna, Victoria ^{[2
]}

Orseau, Laurent ^{[2
]}

Legg, Shane ^{[2
]}

机构：

[1] Australian Natl Univ, Canberra, ACT, Australia

[2] DeepMind, London, England

来源：

PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE | 2017年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

No real-world reward function is perfect. Sensory errors and software bugs may result in agents getting higher (or lower) rewards than they should. For example, a reinforcement learning agent may prefer states where a sensory error gives it the maximum reward, but where the true reward is actually small. We formalise this problem as a generalised Markov Decision Problem called Corrupt Reward MDP. Traditional RL methods fare poorly in CRMDPs, even under strong simplifying assumptions and when trying to compensate for the possibly corrupt rewards. Two ways around the problem are investigated. First, by giving the agent richer data, such as in inverse reinforcement learning and semi-supervised reinforcement learning, reward corruption stemming from systematic sensory errors may sometimes be completely managed. Second, by using randomisation to blunt the agent's optimisation, reward corruption can be partially managed under some assumptions.

引用

页码：4705 / 4713

页数：9

共 50 条

[21] Active Learning for Reward Estimation in Inverse Reinforcement Learning
Lopes, Manuel
Melo, Francisco
Montesano, Luis
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PT II, 2009, 5782 : 31 - +
[22] Learning Reward Machines for Partially Observable Reinforcement Learning
Icarte, Rodrigo Toro
Waldie, Ethan
Klassen, Toryn Q.
Valenzano, Richard
Castro, Margarita P.
McIlraith, Sheila A.
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
[23] Maximum reward reinforcement learning: A non-cumulative reward criterion
Quah, K. H.
Quek, Chai
EXPERT SYSTEMS WITH APPLICATIONS, 2006, 31 (02) : 351 - 359
[24] Direct reward and indirect reward in multi-agent reinforcement learning
Ohta, M
ROBOCUP 2002: ROBOT SOCCER WORLD CUP VI, 2003, 2752 : 359 - 366
[25] Reinforcement Learning with Reward Shaping and Hybrid Exploration in Sparse Reward Scenes
Yang, Yulong
Cao, Weihua
Guo, Linwei
Gan, Chao
Wu, Min
2023 IEEE 6TH INTERNATIONAL CONFERENCE ON INDUSTRIAL CYBER-PHYSICAL SYSTEMS, ICPS, 2023,
[26] Reinforcement learning and the reward positivity with aversive outcomes
Bauer, Elizabeth A.
Watanabe, Brandon K.
Macnamara, Annmarie
PSYCHOPHYSIOLOGY, 2024, 61 (04)
[27] Reward Certification for Policy Smoothed Reinforcement Learning
Mu, Ronghui
Marcolino, Leandro Soriano
Zhang, Yanghao
Zhang, Tianle
Huang, Xiaowei
Ruan, Wenjie
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 19, 2024, : 21429 - 21437
[28] Direct reward and indirect reward in multi-agent reinforcement learning
Ohta, M. (ohta@carc.aist.go.jp), (Springer Verlag):
[29] A Modified Average Reward Reinforcement Learning Based on Fuzzy Reward Function
Zhai, Zhenkun
Chen, Wei
Li, Xiong
Guo, Jing
IMECS 2009: INTERNATIONAL MULTI-CONFERENCE OF ENGINEERS AND COMPUTER SCIENTISTS, VOLS I AND II, 2009, : 113 - 117
[30] Reinforcement Learning in Reward-Mixing MDPs
Kwon, Jeongyeol
Efroni, Yonathan
Caramanis, Constantine
Mannor, Shie
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34

← 1 2 3 4 5 →