Exploration Using Without-Replacement Sampling of Actions Is Sometimes Inferior

被引:2
作者
Carden, Stephen W. [1 ]
Walker, S. Dalton [2 ]
机构
[1] Georgia Southern Univ, Dept Math Sci, Statesboro, GA 30460 USA
[2] Air Force Mat Command, Robins Air Force Base, GA 31098 USA
来源
MACHINE LEARNING AND KNOWLEDGE EXTRACTION | 2019年 / 1卷 / 02期
关键词
count-based exploration; without-replacement sampling; stochastic shortest path; reinforcement learning; Markov decision processes; CONVERGENCE;
D O I
10.3390/make1020041
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In many statistical and machine learning applications, without-replacement sampling is considered superior to with-replacement sampling. In some cases, this has been proven, and in others the heuristic is so intuitively attractive that it is taken for granted. In reinforcement learning, many count-based exploration strategies are justified by reliance on the aforementioned heuristic. This paper will detail the non-intuitive discovery that when measuring the goodness of an exploration strategy by the stochastic shortest path to a goal state, there is a class of processes for which an action selection strategy based on without-replacement sampling of actions can be worse than with-replacement sampling. Specifically, the expected time until a specified goal state is first reached can be provably larger under without-replacement sampling. Numerical experiments describe the frequency and severity of this inferiority.
引用
收藏
页码:698 / 714
页数:17
相关论文
共 31 条