MENTOR: Guiding Hierarchical Reinforcement Learning With Human Feedback and Dynamic Distance Constraint

被引：0

作者：

Zhou, Xinglin ^{[1
]}

Yuan, Yifu ^{[2
]}

Yang, Shaofu ^{[3
]}

Hao, Jianye ^{[2
]}

机构：

[1] Southeast Univ, Southeast Univ Monash Univ Joint Grad Sch, Suzhou 215123, Peoples R China

[2] Tianjin Univ, Coll Intelligence & Comp, Tianjin 300072, Peoples R China

[3] Southeast Univ, Sch Comp Sci & Engn, Nanjing 211189, Peoples R China

来源：

IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE | 2025年 / 9卷 / 02期

基金：

中国国家自然科学基金;

关键词：

Training; Hafnium; Reinforcement learning; Trajectory; Manuals; Computational intelligence; Uncertainty; Space exploration; Optimization; Dynamic scheduling; Dynamic distance constraint; hierarchical reinforcement learning; reinforcement learning from human feedback;

D O I：

10.1109/TETCI.2025.3529902

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Hierarchical reinforcement learning (HRL) provides a promising solution for complex tasks with sparse rewards of agents, which uses a hierarchical framework that divides tasks into subgoals and completes them sequentially. However, current methods struggle to find suitable subgoals for ensuring a stable learning process. To address the issue, we propose a general hierarchical reinforcement learning framework incorporating human feedback and dynamic distance constraints, termed MENTOR, which acts as a "mentor". Specifically, human feedback is incorporated into high-level policy learning to find better subgoals. Furthermore, we propose the Dynamic Distance Constraint (DDC) mechanism dynamically adjusting the space of optional subgoals, such that MENTOR can generate subgoals matching the low-level policy learning process from easy to hard. As a result, the learning efficiency can be improved. As for low-level policy, a dual policy is designed for exploration-exploitation decoupling to stabilize the training process. Extensive experiments demonstrate that MENTOR uses a small amount of human feedback to achieve significant improvement in complex tasks with sparse rewards.

引用

页码：1292 / 1306

页数：15

共 52 条

[1]

Ahn M., 2022, arXiv

[2]

Andrychowicz Marcin, 2017, NEURIPS, V30

[3]

[Anonymous], 2021, P MACHINE LEARNING R, V139

[4]

Bacon PL, 2017, AAAI CONF ARTIF INTE, P1726

[5]

Bagaria A., 2020, P INT C LEARN REPR A, P7114

[6]

Bellemare MG, 2016, ADV NEUR IN, V29

[7]

Bertsekas Dimitri P., 1997, J OPER RES SOC, V48, P334, DOI DOI 10.1057/PALGRAVE.JORS.2600425

[8] Hierarchical learning from human preferences and curiosity [J].

Bougie, Nicolas ;

Ichise, Ryutaro .

APPLIED INTELLIGENCE, 2022, 52 (07) :7459-7479

[9]

BRADLEY RA, 1952, BIOMETRIKA, V39, P324, DOI 10.1093/biomet/39.3-4.324

[10]

Brown TB, 2020, ADV NEUR IN, V33

← 1 2 3 4 5 6 →