Steady State Analysis of Episodic Reinforcement Learning

被引:0
作者
Huang Bojun [1 ]
机构
[1] Rakuten Inst Technol, Tokyo, Japan
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020 | 2020年 / 33卷
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper proves that the episodic learning environment of every finite-horizon decision task has a unique steady state under any behavior policy, and that the marginal distribution of the agent's input indeed converges to the steady-state distribution in essentially all episodic learning processes. This observation supports an interestingly reversed mindset against conventional wisdom: While the existence of unique steady states was often presumed in continual learning but considered less relevant in episodic learning, it turns out their existence is guaranteed for the latter. Based on this insight, the paper unifies episodic and continual RL around several important concepts that have been separately treated in these two RL formalisms. Practically, the existence of unique and approachable steady state enables a general way to collect data in episodic RL tasks, which the paper applies to policy gradient algorithms as a demonstration, based on a new steady-state policy gradient theorem. Finally, the paper also proposes and experimentally validates a perturbation method that facilitates rapid steady-state convergence in real-world RL tasks.
引用
收藏
页数:12
相关论文
共 35 条
  • [11] Simulation-based optimization of Markov reward processes
    Marbach, P
    Tsitsiklis, JN
    [J]. IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2001, 46 (02) : 191 - 209
  • [12] Meyn SP., 2012, MARKOV CHAINS STOCHA
  • [13] Mnih V, 2016, PR MACH LEARN RES, V48
  • [14] Human-level control through deep reinforcement learning
    Mnih, Volodymyr
    Kavukcuoglu, Koray
    Silver, David
    Rusu, Andrei A.
    Veness, Joel
    Bellemare, Marc G.
    Graves, Alex
    Riedmiller, Martin
    Fidjeland, Andreas K.
    Ostrovski, Georg
    Petersen, Stig
    Beattie, Charles
    Sadik, Amir
    Antonoglou, Ioannis
    King, Helen
    Kumaran, Dharshan
    Wierstra, Daan
    Legg, Shane
    Hassabis, Demis
    [J]. NATURE, 2015, 518 (7540) : 529 - 533
  • [15] Pardo F., 2017, ARXIV171200378
  • [16] Schulman J., 2017, Equivalence between policy gradients and soft q-learning
  • [17] Schulman J., 2017, ARXIV170706347, DOI [10.48550/arXiv.1707.06347, DOI 10.48550/ARXIV.1707.06347]
  • [18] Schulman J, 2015, PR MACH LEARN RES, V37, P1889
  • [19] Serfozo R, 2009, PROBAB APPL SER, P1
  • [20] Silver D., 2017, PMLR, P3191