A Hybrid Online Off-Policy Reinforcement Learning Agent Framework Supported by Transformers

被引:2
|
作者
Villarrubia-Martin, Enrique Adrian [1 ]
Rodriguez-Benitez, Luis [1 ]
Jimenez-Linares, Luis [1 ]
Munoz-Valero, David [2 ]
Liu, Jun [3 ]
机构
[1] Univ Castilla La Mancha, Dept Technol & Informat Syst, Paseo Univ, Ciudad Real 413005, Spain
[2] Univ Castilla La Mancha, Dept Technol & Informat Syst, Ave Carlos III,s-n, Toledo 45004, Spain
[3] Univ Ulster, Sch Comp, Belfast, North Ireland
关键词
Reinforcement learning; self-attention; off-policy; Transformer; experience replay; LEVEL;
D O I
10.1142/S012906572350065X
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Reinforcement learning (RL) is a powerful technique that allows agents to learn optimal decision-making policies through interactions with an environment. However, traditional RL algorithms suffer from several limitations such as the need for large amounts of data and long-term credit assignment, i.e. the problem of determining which actions actually produce a certain reward. Recently, Transformers have shown their capacity to address these constraints in this area of learning in an offline setting. This paper proposes a framework that uses Transformers to enhance the training of online off-policy RL agents and address the challenges described above through self-attention. The proposal introduces a hybrid agent with a mixed policy that combines an online off-policy agent with an offline Transformer agent using the Decision Transformer architecture. By sequentially exchanging the experience replay buffer between the agents, the agent's learning training efficiency is improved in the first iterations and so is the training of Transformer-based RL agents in situations with limited data availability or unknown environments.
引用
收藏
页数:19
相关论文
共 50 条
  • [41] Multi-agent Gradient-Based Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning
    Ren, Jineng
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS, 2024, 17 (01)
  • [42] Multi-agent off-policy actor-critic algorithm for distributed multi-task reinforcement learning
    Stankovic, Milos S.
    Beko, Marko
    Ilic, Nemanja
    Stankovic, Srdjan S.
    EUROPEAN JOURNAL OF CONTROL, 2023, 74
  • [43] Off-Policy Temporal Difference Learning with Bellman Residuals
    Yang, Shangdong
    Sun, Dingyuanhao
    Chen, Xingguo
    MATHEMATICS, 2024, 12 (22)
  • [44] Online Attentive Kernel-Based Off-Policy Temporal Difference Learning
    Yang, Shangdong
    Zhang, Shuaiqiang
    Chen, Xingguo
    APPLIED SCIENCES-BASEL, 2024, 14 (23):
  • [45] Off-Policy Meta-Reinforcement Learning With Belief-Based Task Inference
    Imagawa, Takahisa
    Hiraoka, Takuya
    Tsuruoka, Yoshimasa
    IEEE ACCESS, 2022, 10 : 49494 - 49507
  • [46] Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics
    Steckelmacher, Denis
    Plisnier, Helene
    Roijers, Diederik M.
    Nowe, Ann
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2019, PT III, 2020, 11908 : 19 - 34
  • [47] Off-policy correction algorithm for double Q network based on deep reinforcement learning
    Zhang, Qingbo
    Liu, Manlu
    Wang, Heng
    Qian, Weimin
    Zhang, Xinglang
    IET CYBER-SYSTEMS AND ROBOTICS, 2023, 5 (04)
  • [48] Model-Based Off-Policy Deep Reinforcement Learning With Model-Embedding
    Tan, Xiaoyu
    Qu, Chao
    Xiong, Junwu
    Zhang, James
    Qiu, Xihe
    Jin, Yaochu
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024, 8 (04): : 2974 - 2986
  • [49] Relative importance sampling for off-policy actor-critic in deep reinforcement learning
    Mahammad Humayoo
    Gengzhong Zheng
    Xiaoqing Dong
    Liming Miao
    Shuwei Qiu
    Zexun Zhou
    Peitao Wang
    Zakir Ullah
    Naveed Ur Rehman Junejo
    Xueqi Cheng
    Scientific Reports, 15 (1)
  • [50] An Off-Policy Trust Region Policy Optimization Method With Monotonic Improvement Guarantee for Deep Reinforcement Learning
    Meng, Wenjia
    Zheng, Qian
    Shi, Yue
    Pan, Gang
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (05) : 2223 - 2235