Shaping reward learning approach from passive samples

被引:1
作者
Qian, Yu [1 ]
Yu, Yang [1 ]
Zhou, Zhi-Hua [1 ]
机构
[1] National Key Laboratory for Novel Software Technology
来源
Ruan Jian Xue Bao/Journal of Software | 2013年 / 24卷 / 11期
关键词
Passive sample; Policy-invariance; Reinforcement learning; Shaping reward;
D O I
10.3724/SP.J.1001.2013.04471
中图分类号
学科分类号
摘要
Reinforcement learning (RL) deals with long-term reward maximization problems via learning correct short-term decisions from on previous experience. It has been revealed that reward shaping, which provides simpler and easier reward functions to replace the actual environmental reward, is an effective way to guide and accelerate reinforcement learning. However, building a shaping reward requires either domain knowledge or demonstrations from an optimal policy, both involve participation of human experts that is costly. This work investigates whether it is possible to automatically learn a better shaping reward along with an RL process. RL algorithms commonly sample a lot of trajectories throughout the learning process. Those passive samples, though containing many failed attempts, may provide useful information for building a shaping reward function. A policy-invariance condition for reward shaping is introduced as a more effective way to handle noisy examples, followed by the RFPotential approach to learn a shaping reward from massive examples efficiently. Empirical studies on various RL algorithms and domains show that RFPotential can accelerate the RL process. ©Copyright 2013, Institute of Software, the Chinese Academy of Sciences.
引用
收藏
页码:2667 / 2675
页数:8
相关论文
共 20 条
  • [1] Wiewiora E., Cottrell G.W., Elkan C., Principled methods for advising reinforcement learning agents, Proc. of the 20th Int'l Conf. on Machine Learning, pp. 792-799, (2003)
  • [2] Babes M., Munoz de Cote E., Littman M.L., Social reward shaping in the prisoner's dilemma, Proc. of the 7th Int'l Joint Conf. on Autonomous Agents and Multi-Agent Systems, 3, pp. 1389-1392, (2008)
  • [3] Marthi B., Automatic shaping and decomposition of reward functions, Proc. of the 24th Int'l Conf. on Machine Learning, pp. 601-608, (2007)
  • [4] Randlv J., Alstrm P., Learning to drive a bicycle using reinforcement learning and shaping, Proc. of the 15th Int'l Conf. on Machine Learning, pp. 463-471, (1998)
  • [5] Dorigo M., Colombetti M., Robot shaping: Developing autonomous agents through learning, Artificial Intelligence, 71, 2, pp. 321-370, (1994)
  • [6] Mataric M.J., Reward functions for accelerated learning, Proc. of the 11th Int'l Conf. on Machine Learning, pp. 181-189, (1994)
  • [7] Ng A.Y., Harada D., Russell S.J., Policy invariance under reward transformations: Theory and application to reward shaping, Proc. of the 16th Int'l Conf. on Machine Learning, pp. 278-287, (1999)
  • [8] Devlin S., Kudenko D., Dynamic potential-based reward shaping, Proc. of the 11th Int'l Joint Conf. on Autonomous Agents and Multiagent Systems, pp. 433-440, (2012)
  • [9] Ng A.Y., Russell S.J., Algorithms for inverse reinforcement learning, Proc. of the 17th Int'l Conf. on Machine Learning, pp. 663-670, (2000)
  • [10] Ziebart B.D., Maas A.L., Bagnell J.A., Dey A.K., Maximum entropy inverse reinforcement learning, Proc. of the 23rd AAAI Conf. on Artificial Intelligence, pp. 1433-1438, (2008)