A Tighter Problem-Dependent Regret Bound for Risk-Sensitive Reinforcement Learning

被引:0
作者
Hu, Xiaoyan [1 ]
Leung, Ho-Fung [1 ]
机构
[1] Chinese Univ Hong Kong, Hong Kong, Peoples R China
来源
INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 206 | 2023年 / 206卷
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study the regret for risk-sensitive reinforcement learning (RL) with the exponential utility in the episodic MDP. Recent works establish both a lower bound Omega((e(vertical bar beta vertical bar(H-1)/2) - 1)root SAT/vertical bar beta vertical bar) and the best known (upper) bound (O) over tilde ((e(vertical bar beta vertical bar H) - 1)root H(2)SAT/vertical bar beta vertical bar), where H is the length of the episode, S the size of state space, A the size of action space, T the total number of timesteps, and beta the risk parameter. The gap between the upper and the lower bound is exponential and hence is unsatisfactory. In this paper, we show that a variant of UCB-ADVANTAGE algorithm reduces a factor of root H from the best previously known bound in any arbitrary MDP. To further sharpen the regret bound, we introduce a brand new mechanism of regret analysis and derive a problem-dependent regret bound without prior knowledge of the MDP from the algorithm. This bound is much tighter in MDPs with special structures. Particularly, we show that a regret that matches the information-theoretic lower bound up to logarithmic factors can be attained within a rich class of MDPs, which improves an exponential factor over the best previously known bound. Further, we derive a novel information-theoretic lower bound of Omega(max(h is an element of[H]) c(v,h+1)* root SAT/vertical bar beta vertical bar), where max(h is an element of[H]) c(v,h+1)* is a problem-dependent statistic. This lower bound shows that the problem-dependent regret bound achieved by the algorithm is optimal in its dependence on max(h is an element of[H]) c(v,h+)1*.
引用
收藏
页数:27
相关论文
共 42 条
  • [1] [Anonymous], 2018, ADV NEURAL INFORM PR
  • [2] Azar M. G., 2017, INT C MACHINE LEARNI, V70, P263
  • [3] More Risk-Sensitive Markov Decision Processes
    Baeuerle, Nicole
    Rieder, Ulrich
    [J]. MATHEMATICS OF OPERATIONS RESEARCH, 2014, 39 (01) : 105 - 120
  • [4] Bai Y, 2019, ADV NEUR IN, V32
  • [5] BARTLETT RA, 2009, P 25 C UNC ART INT, P35
  • [6] Bernhard J., 2019, 2019 IEEE INT VEH S
  • [7] Bertsekas D. P., 2009, NEURODYNAMIC PROGRAM, P2555, DOI [10.1007/978-0-387-74759-0440, DOI 10.1007/978-0-387-74759-0440]
  • [8] Q-learning for risk-sensitive control
    Borkar, VS
    [J]. MATHEMATICS OF OPERATIONS RESEARCH, 2002, 27 (02) : 294 - 311
  • [9] INVENTORY CONTROL WITH AN EXPONENTIAL UTILITY CRITERION
    BOUAKIZ, M
    SOBEL, MJ
    [J]. OPERATIONS RESEARCH, 1992, 40 (03) : 603 - 608
  • [10] Cassel Asaf B., 2018, C LEARN THEOR, P1295