Mitigating Value Hallucination in Dyna-Style Planning via Multistep Predecessor Models

被引:0
|
作者
Aminmansour F. [1 ]
Jafferjee T. [1 ]
Imani E. [1 ]
Talvitie E.J. [2 ]
Bowling M. [3 ]
White M. [3 ]
机构
[1] Dept of Computing Science, the Alberta Machine Intelligence Inst, University of Alberta
[2] Dept of Computer Science, Harvey Mudd College
[3] Dept of Computing Science & Amii, University of Alberta
来源
Journal of Artificial Intelligence Research | 2024年 / 80卷
基金
加拿大自然科学与工程研究理事会;
关键词
D O I
10.1613/jair.1.15155
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dyna-style reinforcement learning (RL) agents improve sample efficiency over model-free RL agents by updating the value function with simulated experience generated by an environment model. However, it is often difficult to learn accurate models of environment dynamics, and even small errors may result in failure of Dyna agents. In this paper, we highlight that one potential cause of that failure is bootstrapping off of the values of simulated states, and introduce a new Dyna algorithm to avoid this failure. We discuss a design space of Dyna algorithms, based on using successor or predecessor models—simulating forwards or backwards—and using one-step or multi-step updates. Three of the variants have been explored, but surprisingly the fourth variant has not: using predecessor models with multi-step updates. We present the Hallucinated Value Hypothesis (HVH): updating the values of real states towards values of simulated states can result in misleading action values which adversely affect the control policy. We discuss and evaluate all four variants of Dyna amongst which three update real states toward simulated states — so potentially toward hallucinated values — and our proposed approach, which does not. The experimental results provide evidence for the HVH, and suggest that using predecessor models with multi-step updates is a promising direction toward developing Dyna algorithms that are more robust to model error. ©2024 The Authors.
引用
收藏
页码:441 / 473
页数:32
相关论文
共 6 条
  • [1] Mitigating Value Hallucination in Dyna-Style Planning via Multistep Predecessor Models
    Aminmansour, Farzane
    Jafferjee, Taher
    Imani, Ehsan
    Talvitie, Erin J.
    Bowling, Michael
    White, Martha
    JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2024, 80 : 441 - 473
  • [2] Selective Dyna-Style Planning Under Limited Model Capacity
    Abbas, Zaheer
    Sokota, Samuel
    Talvitie, Erin J.
    White, Martha
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119, 2020, 119
  • [3] TADS: Learning Time-Aware Scheduling Policy with Dyna-Style Planning for Spaced Repetition
    Yang, Zhengyu
    Shen, Jian
    Liu, Yunfei
    Yang, Yang
    Zhang, Weinan
    Yu, Yong
    PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20), 2020, : 1917 - 1920
  • [4] Mitigating spatial hallucination in large language models for path planning via prompt engineering
    Zhang, Hongjie
    Deng, Hourui
    Ou, Jie
    Feng, Chaosheng
    SCIENTIFIC REPORTS, 2025, 15 (01):
  • [5] Towards Mitigating Hallucination in Large Language Models via Self-Reflection
    Ji, Ziwei
    Yu, Tiezheng
    Xu, Yan
    Lee, Nayeon
    Ishii, Etsuko
    Fung, Pascale
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 1827 - 1843
  • [6] Mitigating Hallucination in Visual-Language Models via Re-balancing Contrastive Decoding
    Liang, Xiaoyu
    Yu, Jiayuan
    Mu, Lianrui
    Zhuang, Jiedong
    Hu, Jiaqi
    Yang, Yuchen
    Ye, Jiangnan
    Lu, Lu
    Chen, Jian
    Hu, Haoji
    PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 482 - 496