Finding optimal memoryless policies of POMDPs under the expected average reward criterion

被引：19

作者：

Li, Yanjie ^{[1
]}

Yin, Baoqun ^{[2
]}

Xi, Hongsheng ^{[2
]}

机构：

[1] Harbin Inst Technol, Shenzhen Grad School, Div Control & Mechatron Engn, Shenzhen 518055, Peoples R China

[2] Univ Sci & Technol China, Dept Automat, Hefei 230026, Anhui, Peoples R China

来源：

EUROPEAN JOURNAL OF OPERATIONAL RESEARCH | 2011年 / 211卷 / 03期

基金：

国家高技术研究发展计划(863计划); 美国国家科学基金会; 中国国家自然科学基金;

关键词：

POMDPs; Performance difference; Policy iteration with step sizes; Correlated actions; Memoryless policy; EVENT-BASED OPTIMIZATION; INFINITE-HORIZON; MARKOV; POTENTIALS; ALGORITHMS; ITERATION;

D O I：

10.1016/j.ejor.2010.12.014

中图分类号：

C93 [管理学];

学科分类号：

12 ; 1201 ; 1202 ; 120202 ;

摘要：

In this paper, partially observable Markov decision processes (POMDPs) with discrete state and action space under the average reward criterion are considered from a recent-developed sensitivity point of view. By analyzing the average-reward performance difference formula, we propose a policy iteration algorithm with step sizes to obtain an optimal or local optimal memoryless policy. This algorithm improves the policy along the same direction as the policy iteration does and suitable step sizes guarantee the convergence of the algorithm. moreover, the algorithm can be used in Markov decision processes (MDPs) with correlated actions. Two numerical examples are provided to illustrate the applicability of the algorithm. (C) 2010 Elsevier B.V. All rights reserved.

引用

页码：556 / 567

页数：12

共 31 条

[1]

[Anonymous], 2000, INTRO MARKOV DECISIO

[2]

[Anonymous], REINFORCEMENT LEARNI

[3] NEURONLIKE ADAPTIVE ELEMENTS THAT CAN SOLVE DIFFICULT LEARNING CONTROL-PROBLEMS [J].

BARTO, AG ;

SUTTON, RS ;

ANDERSON, CW .

IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS, 1983, 13 (05) :834-846

[4] Infinite-horizon policy-gradient estimation [J].

Baxter, J ;

Bartlett, PL .

JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2001, 15 :319-350

[5] Experiments with infinite-horizon, policy-gradient estimation [J].

Baxter, J ;

Bartlett, PL ;

Weaver, L .

JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2001, 15 :351-381

[6]

Bernstein DS, 2005, 19TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI-05), P1287

[7]

Bertsekas D., 1996, Optimization and neural computation series, V27

[8] Event-based optimization of Markov systems [J].

Cao, Xi-Ren ;

Zhang, Junyu .

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2008, 53 (04) :1076-1082

[9] The nth-order bias optimality for multichain Markov decision processes [J].

Cao, Xi-Ren ;

Zhang, Junyu .

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2008, 53 (02) :496-508

[10]

Cao Xi- Ren, 2007, STOCHASTIC LEARNING

← 1 2 3 4 →