Learning domain structure through probabilistic policy reuse in reinforcement learning

被引：27

作者：

Fernandez, Fernando ^{[1
]}

Veloso, Manuela ^{[2
]}

机构：

[1] Univ Carlos III Madrid, Leganes, Spain

[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

来源：

PROGRESS IN ARTIFICIAL INTELLIGENCE | 2013年 / 2卷 / 01期

关键词：

Probabilistic Policy Reuse; Transfer learning; Reinforcement learning; Domain structure learning;

D O I：

10.1007/s13748-012-0026-6

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Policy Reuse is a transfer learning approach to improve a reinforcement learner with guidance from previously learned similar policies. The method uses the past policies as a probabilistic bias where the learner chooses among the exploitation of the ongoing learned policy, the exploration of random unexplored actions, and the exploitation of past policies. In this work, we demonstrate that Policy Reuse further contributes to the learning of the structure of a domain. Interestingly and almost as a side effect, Policy Reuse identifies classes of similar policies revealing a basis of core-policies of the domain. We demonstrate theoretically that, under a set of conditions to be satisfied, reusing such a set of core-policies allows us to bound the minimal expected gain received while learning a new policy. In general, Policy Reuse contributes to the overall goal of lifelong reinforcement learning, as (i) it incrementally builds a policy library; (ii) it provides a mechanism to reuse past policies; and (iii) it learns an abstract domain structure in terms of core-policies of the domain.

引用

页码：13 / 27

页数：15

共 45 条

[1]

[Anonymous], 2000, P ICML

[2]

Bowling M., 1999, P IJCAI 99

[3]

Bruce J., 2002, P IROS 2002 SWITZ 20

[4]

Carroll J., 2002, P INT C MACH LEARN A

[5]

CARROLL JL, 2001, P INT JOINT C NEUR N

[6]

Chevaleyre Y., 2009, LECT NOTES ARTIFICIA, V5465

[7]

Da Silva B. N., 2010, P 9 INT JOINT C AUT, P317

[8]

Dasgupta P, 2012, LECT NOTES ARTIF INT, V7068, P330

[9] Hierarchical reinforcement learning with the MAXQ value function decomposition [J].

Dietterich, TG .

JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2000, 13 :227-303

[10]

Dixon K., 2000, I FORCOMPLEX ENG SYS

← 1 2 3 4 5 →