Adversarial multi-armed bandit approach to two-person zero-sum Markov games

被引:0
|
作者
Chang, Hyeong Soo [1 ]
Fu, Michael C. [2 ]
Marcus, Steven I. [3 ]
机构
[1] Sogang Univ, Dept Comp Sci & Engn, Seoul, South Korea
[2] Univ Maryland, Sch Business & Inst Syst Res, College Pk, MD 20742 USA
[3] Univ Maryland, Dept Elect & comp Sci, College Pk, MD 20742 USA
来源
PROCEEDINGS OF THE 46TH IEEE CONFERENCE ON DECISION AND CONTROL, VOLS 1-14 | 2007年
基金
美国国家科学基金会;
关键词
Markov game; Markov decision process; sample average approximation; sampling;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A sampling-based algorithm for solving stochastic optimization problems, based on Auer et al.'s Exp3 algorithm for "adversarial multi-armed bandit problems," has been recently presented by the authors. In particular, the authors recursively extended the Exp3-based algorithm for solving finite-horizon Markov decision processes (MDPs) and analyzed its finite-iteration performance in terms of the expected bias relative to the maximum value of the "recursive sample-average-approximation (SAA)" problem induced by the sampling process in the algorithm, showing that the upper bound of the expected bias approaches zero as the sampling size per state sampled in each stage goes to infinity, leading to the convergence to the optimal value of the original MDP problem in the limit. As a sequel to the previous work, the idea is further extended for solving two-person zero-sum Markov games (MGs), providing a finite-iteration bound to the equilibrium value of the induced "recursive SAA game" problem and asymptotic convergence to the true equilibrium value. The recursively extended algorithm for MGs can be used for breaking the curse of dimensionality.
引用
收藏
页码:238 / 243
页数:6
相关论文
共 11 条