Reducing the Planning Horizon Through Reinforcement Learning

被引：0

作者：

Dunbar, Logan ^{[1
]}

Rosman, Benjamin ^{[2
]}

Cohn, Anthony G. ^{[1
,3
,4
,5
,6
]}

Leonetti, Matteo ^{[7
]}

机构：

[1] Univ Leeds, Sch Comp, Leeds, W Yorkshire, England

[2] Univ Witwatersrand, Johannesburg, South Africa

[3] Tongji Univ, Shanghai, Peoples R China

[4] Alan Turing Inst, London, England

[5] Qingdao Univ Sci & Technol, Qingdao, Peoples R China

[6] Shandong Univ, Jinan, Peoples R China

[7] Kings Coll London, Dept Informat, London, England

来源：

MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2022, PT IV | 2023年 / 13716卷

关键词：

Planning; Planning horizon; Reinforcement learning;

D O I：

10.1007/978-3-031-26412-2_5

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Planning is a computationally expensive process, which can limit the reactivity of autonomous agents. Planning problems are usually solved in isolation, independently of similar, previously solved problems. The depth of search that a planner requires to find a solution, known as the planning horizon, is a critical factor when integrating planners into reactive agents. We consider the case of an agent repeatedly carrying out a task from different initial states. We propose a combination of classical planning and model-free reinforcement learning to reduce the planning horizon over time. Control is smoothly transferred from the planner to the model-free policy as the agent compiles the planner's policy into a value function. Local exploration of the model-free policy allows the agent to adapt to the environment and eventually overcome model inaccuracies. We evaluate the efficacy of our framework on symbolic PDDL domains and a stochastic grid world environment and show that we are able to significantly reduce the planning horizon while improving upon model inaccuracies.

引用

页码：68 / 83

页数：16

共 31 条

[1] A survey of robot learning from demonstration
Argall, Brenna D.
Chernova, Sonia
Veloso, Manuela
Browning, Brett
[J]. ROBOTICS AND AUTONOMOUS SYSTEMS, 2009, 57 (05) : 469 - 483
[2] LEARNING TO ACT USING REAL-TIME DYNAMIC-PROGRAMMING
BARTO, AG
BRADTKE, SJ
SINGH, SP
[J]. ARTIFICIAL INTELLIGENCE, 1995, 72 (1-2) : 81 - 138
[3] Bejjani W, 2019, IEEE INT C INT ROBOT, P6562, DOI [10.1109/iros40897.2019.8967717, 10.1109/IROS40897.2019.8967717]
[4] DISTRIBUTED ASYNCHRONOUS COMPUTATION OF FIXED-POINTS
BERTSEKAS, DP
[J]. MATHEMATICAL PROGRAMMING, 1983, 27 (01) : 107 - 120
[5] Bylander T., 1991, 12 INT JOINT C ART I
[6] Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control
Daw, ND
Niv, Y
Dayan, P
[J]. NATURE NEUROSCIENCE, 2005, 8 (12) : 1704 - 1711
[7] De Klerk M., 2018, SAIEE AFRICA RES J, V109
[8] Retrospective Revaluation in Sequential Decision Making: A Tale of Two Systems
Gershman, Samuel J.
Markman, Arthur B.
Otto, A. Ross
[J]. JOURNAL OF EXPERIMENTAL PSYCHOLOGY-GENERAL, 2014, 143 (01) : 182 - 194
[9] Grounds M, 2008, LECT NOTES ARTIF INT, V4865, P75, DOI 10.1007/978-3-540-77949-0_6
[10] Grzes M., 2008, P 4 IEEE INT C INT S, V2, P10, DOI DOI 10.1109/IS.2008.4670492

← 1 2 3 4 →