Average-Reward Reinforcement Learning with Trust Region Methods

被引：0

作者：

Ma, Xiaoteng ^{[1
]}

Tang, Xiaohang ^{[2
]}

Xia, Li ^{[3
]}

Yang, Jun ^{[1
]}

Zhao, Qianchuan ^{[1
]}

机构：

[1] Tsinghua Univ, Dept Automat, Beijing, Peoples R China

[2] UCL, Dept Stat Sci, London, England

[3] Sun Yat Sen Univ, Business Sch, Guangzhou, Peoples R China

来源：

PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021 | 2021年

关键词：

GRADIENT METHODS;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Most of reinforcement learning algorithms optimize the discounted criterion which is beneficial to accelerate the convergence and reduce the variance of estimates. Although the discounted criterion is appropriate for certain tasks such as financial related problems, many engineering problems treat future rewards equally and prefer a long-run average criterion. In this paper, we study the reinforcement learning problem with the long-run average criterion. Firstly, we develop a unified trust region theory with discounted and average criteria. With the average criterion, a novel performance bound within the trust region is derived with the Perturbation Analysis (PA) theory. Secondly, we propose a practical algorithm named Average Policy Optimization (APO), which improves the value estimation with a novel technique named Average Value Constraint. To the best of our knowledge, our work is the first one to study the trust region approach with the average criterion and it complements the framework of reinforcement learning beyond the discounted criterion. Finally, experiments are conducted in the continuous control environment MuJoCo. In most tasks, APO performs better than the discounted PPO, which demonstrates the effectiveness of our approach.

引用

页码：2797 / 2803

页数：7

共 30 条

[11]

Kakade S, 2001, LECT NOTES ARTIF INT, V2111, P605

[12]

Kakade Sham, 2002, P 19 INT C MACHINE L, P267

[13]

Konda VR, 2000, ADV NEUR IN, V12, P1008

[14]

Lowe R, 2017, ADV NEUR IN, V30

[15] Average reward reinforcement learning: Foundations, algorithms, and empirical results [J].

Mahadevan, S .

MACHINE LEARNING, 1996, 22 (1-3) :159-195

[16] Approximate gradient methods in policy-space optimization of Markov reward processes [J].

Marbach, P ;

Tsitsiklis, JN .

DISCRETE EVENT DYNAMIC SYSTEMS-THEORY AND APPLICATIONS, 2003, 13 (1-2) :111-148

[17]

Marcin Andrychowicz, 2020, ARXIV200605990

[18]

Puterman M.L., 1994, Markov decision processes: discrete stochastic dynamic programming

[19]

Schulman, 2017, ARXIV

[20]

Schulman J, 2015, PR MACH LEARN RES, V37, P1889

← 1 2 3 →