Average-Reward Reinforcement Learning with Trust Region Methods

被引:0
作者
Ma, Xiaoteng [1 ]
Tang, Xiaohang [2 ]
Xia, Li [3 ]
Yang, Jun [1 ]
Zhao, Qianchuan [1 ]
机构
[1] Tsinghua Univ, Dept Automat, Beijing, Peoples R China
[2] UCL, Dept Stat Sci, London, England
[3] Sun Yat Sen Univ, Business Sch, Guangzhou, Peoples R China
来源
PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021 | 2021年
关键词
GRADIENT METHODS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most of reinforcement learning algorithms optimize the discounted criterion which is beneficial to accelerate the convergence and reduce the variance of estimates. Although the discounted criterion is appropriate for certain tasks such as financial related problems, many engineering problems treat future rewards equally and prefer a long-run average criterion. In this paper, we study the reinforcement learning problem with the long-run average criterion. Firstly, we develop a unified trust region theory with discounted and average criteria. With the average criterion, a novel performance bound within the trust region is derived with the Perturbation Analysis (PA) theory. Secondly, we propose a practical algorithm named Average Policy Optimization (APO), which improves the value estimation with a novel technique named Average Value Constraint. To the best of our knowledge, our work is the first one to study the trust region approach with the average criterion and it complements the framework of reinforcement learning beyond the discounted criterion. Finally, experiments are conducted in the continuous control environment MuJoCo. In most tasks, APO performs better than the discounted PPO, which demonstrates the effectiveness of our approach.
引用
收藏
页码:2797 / 2803
页数:7
相关论文
共 30 条
[1]  
Achiam J, 2017, PR MACH LEARN RES, V70
[2]  
[Anonymous], 1960, Finite Markov Chains
[3]  
[Anonymous], 2007, Stochastic Learning and Optimization: A Sensitivity-Based Approach
[4]  
Bertsekas D., 1995, Dynamic Programming and Optimal Control, VII
[5]  
Bertsekas DP, 1995, PROCEEDINGS OF THE 34TH IEEE CONFERENCE ON DECISION AND CONTROL, VOLS 1-4, P560, DOI 10.1109/CDC.1995.478953
[6]  
Brockman Greg, 2016, ARXIV160601540
[7]  
Cui D. M., 2016, NEWZOO, V40, P8225
[8]  
Dewanto Vektor, 2020, ARXIV201008920
[9]  
Duan Y., 2016, PR MACH LEARN RES, P1329, DOI [DOI 10.1109/CVPR.2014.180, 10.5555/3045390.3045531]
[10]  
Howard Ronald A., 1960, MATH GAZ, V3, P120