Mean-Semivariance Policy Optimization via Risk-Averse Reinforcement Learning

被引:0
作者
Ma, Xiaoteng [1 ]
Ma, Shuai [2 ]
Xia, Li [2 ]
Zhao, Qianchuan [1 ]
机构
[1] Tsinghua Univ, Dept Automat, Beijing 100086, Peoples R China
[2] Sun Yat Sen Univ, Sch Business, Guangzhou 510275, Peoples R China
基金
中国国家自然科学基金;
关键词
MARKOV DECISION-PROCESSES; PORTFOLIO SELECTION; VARIANCE; MODEL;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Keeping risk under control is often more crucial than maximizing expected reward in real-world decision-making situations, such as finance, robotics, autonomous driving, etc. The most natural choice of risk measures is variance, while it penalizes the upside volatility as much as the downside part. Instead, the (downside) semivariance, which captures negative deviation of a random variable under its mean, is more suitable for risk-averse proposes. This paper aims at optimizing the mean-semivariance (MSV) criterion in reinforcement learning w.r.t. steady reward distribution. Since semivariance is time-inconsistent and does not satisfy the standard Bellman equation, the traditional dynamic programming methods are inapplicable to MSV problems directly. To tackle this challenge, we resort to Perturbation Analysis (PA) theory and establish the performance difference formula for MSV. We reveal that the MSV problem can be solved by iteratively solving a sequence of RL problems with a policy-dependent reward function. Further, we propose two on-policy algorithms based on the policy gradient theory and the trust region method. Finally, we conduct diverse experiments from simple bandit problems to continuous control tasks in MuJoCo, which demonstrate the effectiveness of our proposed methods.
引用
收藏
页码:569 / 595
页数:27
相关论文
共 60 条
[1]  
Abdolmaleki A., 2018, 6 INT C LEARN REPR I
[2]  
Achiam J, 2017, PR MACH LEARN RES, V70
[3]  
Berner C., DOTA 2 LARGE SCALE D
[4]  
Bisi L, 2020, PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P4583
[5]   Good Volatility, Bad Volatility, and the Cross Section of Stock Returns [J].
Bollerslev, Tim ;
Li, Sophia Zhengzi ;
Zhao, Bingzhi .
JOURNAL OF FINANCIAL AND QUANTITATIVE ANALYSIS, 2020, 55 (03) :751-781
[6]   Risk-sensitive optimal control for Markov decision processes with monotone cost [J].
Borkar, VS ;
Meyn, SP .
MATHEMATICS OF OPERATIONS RESEARCH, 2002, 27 (01) :192-209
[7]   Multi-horizon Markowitz portfolio performance appraisals: A general approach [J].
Briec, Walter ;
Kerstens, Kristiaan .
OMEGA-INTERNATIONAL JOURNAL OF MANAGEMENT SCIENCE, 2009, 37 (01) :50-62
[8]  
Brockman G, 2016, Arxiv, DOI arXiv:1606.01540
[9]   Stochastic learning and optimization-A sensitivity-based approach [J].
Cao, Xi-Ren .
ANNUAL REVIEWS IN CONTROL, 2009, 33 (01) :11-24
[10]  
Castro DD, 2012, P 29 INT C MACH LEAR