PCDT: Pessimistic Critic Decision Transformer for Offline Reinforcement Learning

被引:0
作者
Wang, Xuesong [1 ]
Zhang, Hengrui [1 ]
Zhang, Jiazhi [1 ]
Chen, C. L. Philip [2 ]
Cheng, Yuhu [1 ]
机构
[1] China Univ Min & Technol, Sch Informat & Control Engn, Xuzhou 221116, Peoples R China
[2] South China Univ Technol, Sch Comp Sci & Engn, Guangzhou 510006, Peoples R China
来源
IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS | 2025年
基金
中国国家自然科学基金;
关键词
Decision transformer (DT); offline reinforcement learning (offline RL); pessimistic critic; sequence importance sampling;
D O I
10.1109/TSMC.2025.3583392
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Decision transformer DT, as a conditional sequence modeling (CSM) approach, learns the action distribution for each state using historical information, such as trajectory returns, offering a supervised learning paradigm for offline reinforcement learning (Offline RL). However, due to the fact that decision transformer (DT) solely concentrates on an individual trajectory with high returns-to-go, it neglects the potential for constructing optimal trajectories by combining sequences of different actions. In other words, traditional DT lacks the trajectory stitching capability. To address the concern, a novel DT (PCDT) for Offline RL is proposed. Our approach begins by pretraining a standard DT to explicitly capture behavior sequences. Next, we apply the sequence importance sampling to penalize actions that significantly deviate from these behavior sequences, thereby constructing a pessimistic critic. Finally, Q-values are integrated into the policy update process, enabling the learned policy to approximate the behavior policy while favoring actions associated with the highest Q-value. Theoretical analysis shows that the sequence importance sampling in pessimistic critic decision transformer (PCDT) establishes a pessimistic lower bound, while the value optimality ensures that PCDT is capable of learning the optimal policy. Results on the D4RL benchmark tasks and ablation studies show that PCDT inherits the strengths of actor-critic (AC) and CSM methods, achieving the highest normalized scores on challenging sparse-reward and long-horizon tasks. Our code are available at https://github.com/Henry0132/PCDT.
引用
收藏
页数:12
相关论文
共 35 条
[1]  
Brandfonbrener D., 2022, Advances in Neural Information Processing Systems, P1542
[2]   Dual Behavior Regularized Offline Deterministic Actor-Critic [J].
Cao, Shuo ;
Wang, Xuesong ;
Cheng, Yuhu .
IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2024, 54 (08) :4841-4852
[3]  
Chebotar Y, 2023, PR MACH LEARN RES, V229
[4]  
Chen LL, 2021, ADV NEUR IN, V34
[5]   Robotic Grasp Detection Using Structure Prior Attention and Multiscale Features [J].
Chen, Lu ;
Niu, Mingdi ;
Yang, Jing ;
Qian, Yuhua ;
Li, Zhuomao ;
Wang, Keqi ;
Yan, Tao ;
Huang, Panfeng .
IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2024, 54 (11) :7039-7053
[6]  
Emmons S., 2021, arXiv, DOI [10.48550/arXiv.2112.10751(2022, DOI 10.48550/ARXIV.2112.10751(2022]
[7]  
Fu J., 2020, arXiv, DOI 10.48550/arXiv.2004.07219
[8]  
Fujimoto S, 2021, ADV NEUR IN, V34
[9]  
Fujimoto S, 2019, PR MACH LEARN RES, V97
[10]  
Fujimoto S, 2018, PR MACH LEARN RES, V80