Stochastic optimization of controlled partially observable Markov decision processes

被引:0
|
作者
Bartlett, PL [1 ]
Baxter, J [1 ]
机构
[1] Australian Natl Univ, Res Sch Info Sci & Eng, Canberra, ACT 0200, Australia
来源
PROCEEDINGS OF THE 39TH IEEE CONFERENCE ON DECISION AND CONTROL, VOLS 1-5 | 2000年
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We introduce an on-line algorithm for finding local maxima of the average reward in a Partially Observable Markov Decision Process (POMDP) controlled by a parameterized policy. Optimization is over the parameters of the policy. The algorithm's chief advantages are that it requires only a single sample path of the POMDP, it uses only one free parameter beta is an element of [0, 1), which has a natural interpretation in terms of a bias-variance trade-off, and it requires no knowledge of the underlying state. In addition, the algorithm can be applied to infinite state, control and observation spaces. We prove almost-sure convergence of our algorithm, and show how the correct setting of 0 is related to the mixing time of the Markov chain induced by the POMDP.
引用
收藏
页码:124 / 129
页数:6
相关论文
共 50 条