MENTOR: Guiding Hierarchical Reinforcement Learning With Human Feedback and Dynamic Distance Constraint

被引：0

作者：

Zhou, Xinglin ^{[1
]}

Yuan, Yifu ^{[2
]}

Yang, Shaofu ^{[3
]}

Hao, Jianye ^{[2
]}

机构：

[1] Southeast Univ, Southeast Univ Monash Univ Joint Grad Sch, Suzhou 215123, Peoples R China

[2] Tianjin Univ, Coll Intelligence & Comp, Tianjin 300072, Peoples R China

[3] Southeast Univ, Sch Comp Sci & Engn, Nanjing 211189, Peoples R China

来源：

IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE | 2025年 / 9卷 / 02期

基金：

中国国家自然科学基金;

关键词：

Training; Hafnium; Reinforcement learning; Trajectory; Manuals; Computational intelligence; Uncertainty; Space exploration; Optimization; Dynamic scheduling; Dynamic distance constraint; hierarchical reinforcement learning; reinforcement learning from human feedback;

D O I：

10.1109/TETCI.2025.3529902

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Hierarchical reinforcement learning (HRL) provides a promising solution for complex tasks with sparse rewards of agents, which uses a hierarchical framework that divides tasks into subgoals and completes them sequentially. However, current methods struggle to find suitable subgoals for ensuring a stable learning process. To address the issue, we propose a general hierarchical reinforcement learning framework incorporating human feedback and dynamic distance constraints, termed MENTOR, which acts as a "mentor". Specifically, human feedback is incorporated into high-level policy learning to find better subgoals. Furthermore, we propose the Dynamic Distance Constraint (DDC) mechanism dynamically adjusting the space of optional subgoals, such that MENTOR can generate subgoals matching the low-level policy learning process from easy to hard. As a result, the learning efficiency can be improved. As for low-level policy, a dual policy is designed for exploration-exploitation decoupling to stabilize the training process. Extensive experiments demonstrate that MENTOR uses a small amount of human feedback to achieve significant improvement in complex tasks with sparse rewards.

引用

页码：1292 / 1306

页数：15

共 52 条

[11]

Burda Y., 2019, P THE INT C LEARNING

[12]

Campos V, 2020, PR MACH LEARN RES, V119

[13] Collaborative governance in disaster management and sustainable development [J].

Dai, Jiapeng ;

Azhar, Aisha .

PUBLIC ADMINISTRATION AND DEVELOPMENT, 2024, 44 (04) :358-380

[14] Hierarchical reinforcement learning with the MAXQ value function decomposition [J].

Dietterich, TG .

JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2000, 13 :227-303

[15]

Eysenbach B., 2019, P INT C LEARN REPR N, P5227

[16]

Ghosh D., 2021, P 9 INT C LEARN REPR, P1

[17]

Hafner D., 2022, P ADV NEUR INF PROC, P26091

[18]

Harb J, 2018, AAAI CONF ARTIF INTE, P3165

[19]

Hartikainen K., 2020, P 8 INT C LEARN REPR, P7057

[20] Hierarchical Reinforcement Learning: A Survey and Open Research Challenges [J].

Hutsebaut-Buysse, Matthias ;

Mets, Kevin ;

Latre, Steven .

MACHINE LEARNING AND KNOWLEDGE EXTRACTION, 2022, 4 (01) :172-221

← 1 2 3 4 5 6 →