Mitigating Value Hallucination in Dyna-Style Planning via Multistep Predecessor Models

被引：0

作者：

Aminmansour F. ^{[1
]}

Jafferjee T. ^{[1
]}

Imani E. ^{[1
]}

Talvitie E.J. ^{[2
]}

Bowling M. ^{[3
]}

White M. ^{[3
]}

机构：

[1] Dept of Computing Science, the Alberta Machine Intelligence Inst, University of Alberta

[2] Dept of Computer Science, Harvey Mudd College

[3] Dept of Computing Science & Amii, University of Alberta

来源：

Journal of Artificial Intelligence Research | 2024年 / 80卷

基金：

加拿大自然科学与工程研究理事会;

关键词：

D O I：

10.1613/jair.1.15155

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Dyna-style reinforcement learning (RL) agents improve sample efficiency over model-free RL agents by updating the value function with simulated experience generated by an environment model. However, it is often difficult to learn accurate models of environment dynamics, and even small errors may result in failure of Dyna agents. In this paper, we highlight that one potential cause of that failure is bootstrapping off of the values of simulated states, and introduce a new Dyna algorithm to avoid this failure. We discuss a design space of Dyna algorithms, based on using successor or predecessor models—simulating forwards or backwards—and using one-step or multi-step updates. Three of the variants have been explored, but surprisingly the fourth variant has not: using predecessor models with multi-step updates. We present the Hallucinated Value Hypothesis (HVH): updating the values of real states towards values of simulated states can result in misleading action values which adversely affect the control policy. We discuss and evaluate all four variants of Dyna amongst which three update real states toward simulated states — so potentially toward hallucinated values — and our proposed approach, which does not. The experimental results provide evidence for the HVH, and suggest that using predecessor models with multi-step updates is a promising direction toward developing Dyna algorithms that are more robust to model error. ©2024 The Authors.

引用

页码：441 / 473

页数：32

共 6 条

[1] Mitigating Value Hallucination in Dyna-Style Planning via Multistep Predecessor Models
Aminmansour, Farzane
Jafferjee, Taher
Imani, Ehsan
Talvitie, Erin J.
Bowling, Michael
White, Martha
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2024, 80 : 441 - 473
[2] Selective Dyna-Style Planning Under Limited Model Capacity
Abbas, Zaheer
Sokota, Samuel
Talvitie, Erin J.
White, Martha
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119, 2020, 119
[3] TADS: Learning Time-Aware Scheduling Policy with Dyna-Style Planning for Spaced Repetition
Yang, Zhengyu
Shen, Jian
Liu, Yunfei
Yang, Yang
Zhang, Weinan
Yu, Yong
PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20), 2020, : 1917 - 1920
[4] Mitigating spatial hallucination in large language models for path planning via prompt engineering
Zhang, Hongjie
Deng, Hourui
Ou, Jie
Feng, Chaosheng
SCIENTIFIC REPORTS, 2025, 15 (01):
[5] Towards Mitigating Hallucination in Large Language Models via Self-Reflection
Ji, Ziwei
Yu, Tiezheng
Xu, Yan
Lee, Nayeon
Ishii, Etsuko
Fung, Pascale
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 1827 - 1843
[6] Mitigating Hallucination in Visual-Language Models via Re-balancing Contrastive Decoding
Liang, Xiaoyu
Yu, Jiayuan
Mu, Lianrui
Zhuang, Jiedong
Hu, Jiaqi
Yang, Yuchen
Ye, Jiangnan
Lu, Lu
Chen, Jian
Hu, Haoji
PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 482 - 496

← 1 →