Deterministic policy gradient algorithms for semi-Markov decision processes

被引：4

作者：

Hosseinloo, Ashkan Haji ^{[1
]}

Dahleh, Munther A. ^{[1
]}

机构：

[1] MIT, Lab Informat & Decis Syst, 77 Massachusetts Ave, Cambridge, MA 02139 USA

来源：

INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS | 2022年 / 37卷 / 07期

关键词：

average reward; deterministic policy; policy gradient theorem; reinforcement learning; SMDP;

D O I：

10.1002/int.22709

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

A large class of sequential decision-making problems under uncertainty, with broad applications from preventive maintenance to event-triggered control can be modeled in the framework of semi-Markov decision processes (SMDPs). Unlike Markov decision processes (MDPs), SMDPs are underexplored in the online and reinforcement learning (RL) settings. In this paper, we extend the well-known deterministic policy gradient (DPG) theorem in MDPs to SMDPs under average-reward criterion. The existing stochastic policy gradient methods not only require, in general, a large number of samples for training, but they also suffer from high variance in the gradient estimation when applied to problems with deterministic optimal policy. Our DPG method can potentially remedy these issues. On the basis of this method and depending on the choice of a critic, different actor-critic algorithms can easily be developed in the RL setup. We present two example actor-critic algorithms. Both algorithms employ our developed policy gradient theorem for their actors, but use two different critics; one uses a simple SARSA update while the other one uses the same on-policy update but with compatible function approximators. We demonstrate the efficacy of our method both mathematically and via simulations.

引用

页码：4008 / 4019

页数：12

共 37 条

[1]

Adelman D., 2003, Manufacturing & Service Operations Management, V5, P348, DOI 10.1287/msom.5.4.348.24884

[2] Natural gradient works efficiently in learning [J].

Amari, S .

NEURAL COMPUTATION, 1998, 10 (02) :251-276

[3] BETTERING OPERATION OF ROBOTS BY LEARNING [J].

ARIMOTO, S ;

KAWAMURA, S ;

MIYAZAKI, F .

JOURNAL OF ROBOTIC SYSTEMS, 1984, 1 (02) :123-140

[4]

Bertsekas Dimitri P, 2000, Dynamic Programming and Optimal Control, V1

[5]

Bradtke S. J., 1995, Advances in Neural Information Processing Systems 7, P393

[6] A survey of iterative learning control [J].

Bristow, Douglas A. ;

Tharayil, Marina ;

Alleyne, Andrew G. .

IEEE CONTROL SYSTEMS MAGAZINE, 2006, 26 (03) :96-114

[7] Semi-Markov decision problems and performance sensitivity analysis [J].

Cao, XR .

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2003, 48 (05) :758-769

[8] Solving semi-Markov decision problems using average reward reinforcement learning [J].

Das, TK ;

Gosavi, A ;

Mahadevan, S ;

Marchalleck, N .

MANAGEMENT SCIENCE, 1999, 45 (04) :560-574

[9]

Ghavamzadeh M, 2007, J MACH LEARN RES, V8, P2629

[10] Reinforcement learning for long-run average cost [J].

Gosavi, A .

EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2004, 155 (03) :654-674

← 1 2 3 4 →