Reducing the Planning Horizon Through Reinforcement Learning

被引:0
作者
Dunbar, Logan [1 ]
Rosman, Benjamin [2 ]
Cohn, Anthony G. [1 ,3 ,4 ,5 ,6 ]
Leonetti, Matteo [7 ]
机构
[1] Univ Leeds, Sch Comp, Leeds, W Yorkshire, England
[2] Univ Witwatersrand, Johannesburg, South Africa
[3] Tongji Univ, Shanghai, Peoples R China
[4] Alan Turing Inst, London, England
[5] Qingdao Univ Sci & Technol, Qingdao, Peoples R China
[6] Shandong Univ, Jinan, Peoples R China
[7] Kings Coll London, Dept Informat, London, England
来源
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2022, PT IV | 2023年 / 13716卷
关键词
Planning; Planning horizon; Reinforcement learning;
D O I
10.1007/978-3-031-26412-2_5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Planning is a computationally expensive process, which can limit the reactivity of autonomous agents. Planning problems are usually solved in isolation, independently of similar, previously solved problems. The depth of search that a planner requires to find a solution, known as the planning horizon, is a critical factor when integrating planners into reactive agents. We consider the case of an agent repeatedly carrying out a task from different initial states. We propose a combination of classical planning and model-free reinforcement learning to reduce the planning horizon over time. Control is smoothly transferred from the planner to the model-free policy as the agent compiles the planner's policy into a value function. Local exploration of the model-free policy allows the agent to adapt to the environment and eventually overcome model inaccuracies. We evaluate the efficacy of our framework on symbolic PDDL domains and a stochastic grid world environment and show that we are able to significantly reduce the planning horizon while improving upon model inaccuracies.
引用
收藏
页码:68 / 83
页数:16
相关论文
共 31 条
  • [1] A survey of robot learning from demonstration
    Argall, Brenna D.
    Chernova, Sonia
    Veloso, Manuela
    Browning, Brett
    [J]. ROBOTICS AND AUTONOMOUS SYSTEMS, 2009, 57 (05) : 469 - 483
  • [2] LEARNING TO ACT USING REAL-TIME DYNAMIC-PROGRAMMING
    BARTO, AG
    BRADTKE, SJ
    SINGH, SP
    [J]. ARTIFICIAL INTELLIGENCE, 1995, 72 (1-2) : 81 - 138
  • [3] Bejjani W, 2019, IEEE INT C INT ROBOT, P6562, DOI [10.1109/iros40897.2019.8967717, 10.1109/IROS40897.2019.8967717]
  • [4] DISTRIBUTED ASYNCHRONOUS COMPUTATION OF FIXED-POINTS
    BERTSEKAS, DP
    [J]. MATHEMATICAL PROGRAMMING, 1983, 27 (01) : 107 - 120
  • [5] Bylander T., 1991, 12 INT JOINT C ART I
  • [6] Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control
    Daw, ND
    Niv, Y
    Dayan, P
    [J]. NATURE NEUROSCIENCE, 2005, 8 (12) : 1704 - 1711
  • [7] De Klerk M., 2018, SAIEE AFRICA RES J, V109
  • [8] Retrospective Revaluation in Sequential Decision Making: A Tale of Two Systems
    Gershman, Samuel J.
    Markman, Arthur B.
    Otto, A. Ross
    [J]. JOURNAL OF EXPERIMENTAL PSYCHOLOGY-GENERAL, 2014, 143 (01) : 182 - 194
  • [9] Grounds M, 2008, LECT NOTES ARTIF INT, V4865, P75, DOI 10.1007/978-3-540-77949-0_6
  • [10] Grzes M., 2008, P 4 IEEE INT C INT S, V2, P10, DOI DOI 10.1109/IS.2008.4670492