ACIS: An Improved Actor-Critic Method for POMDPs with Internal State

被引:2
作者
Xu, Dan [1 ]
Liu, Quan [1 ]
机构
[1] Soochow Univ, Comp Sci & Technol, Suzhou, Peoples R China
来源
2015 IEEE 27TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2015) | 2015年
关键词
Reinforcement Learning; POMDP; Internal State; Policy Gradient; Actor-Critic;
D O I
10.1109/ICTAI.2015.63
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Partially observable Markov decision processes (POMDPs) provide a rich mathematical model for sequential decision making in partially observable and stochastic environments. Model-free methods use the internal state as a substitute of the belief state which is a sufficient statistic of all past action-observation history in model-based techniques. A main drawback of previous model-free techniques, such as direct policy gradient methods, is that their solutions often suffer the high variance of the gradient estimate. This paper proposes a novel algorithm, Actor-Critic with Internal State (ACIS) to reduce the variance by using the policy gradient methods. ACIS gets its power by using the AC framework which updates the parameters of the policy functions in the actor part and uses the temporal difference to estimate the current policy in the critic part. Empirically, ACIS shows better performance than state-of-the-art model-free methods, such as IState-GPOMDP, in terms of the variance and final reward on the Load-Unload and Robot Navigation problems.
引用
收藏
页码:369 / 376
页数:8
相关论文
共 16 条
[1]  
Aberdeen D., 2002, PROC INT C MACHINE L, P3
[2]  
Aberdeen D.A, 2003, THESIS AUSTR NATL U
[3]   Natural gradient works efficiently in learning [J].
Amari, S .
NEURAL COMPUTATION, 1998, 10 (02) :251-276
[4]  
[Anonymous], 2010, SCHOLARPEDIA
[5]  
[Anonymous], 2020, Reinforcement Learning, An Introduction
[6]   Infinite-horizon policy-gradient estimation [J].
Baxter, J ;
Bartlett, PL .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2001, 15 :319-350
[7]  
Busoniu L, 2010, AUTOM CONTROL ENG SE, P1, DOI 10.1201/9781439821091-f
[8]  
Degris Thomas, 2012, P 29 INT COFERENCE I
[9]   Planning and acting in partially observable stochastic domains [J].
Kaelbling, LP ;
Littman, ML ;
Cassandra, AR .
ARTIFICIAL INTELLIGENCE, 1998, 101 (1-2) :99-134
[10]   On actor-critic algorithms [J].
Konda, VR ;
Tsitsiklis, JN .
SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2003, 42 (04) :1143-1166