Distributional Policy Gradient With Distributional Value Function

被引：1

作者：

Liu, Qi ^{[1
,2
]}

Li, Yanjie ^{[1
,2
]}

Shi, Xiongtao ^{[1
,2
]}

Lin, Ke ^{[1
,2
]}

Liu, Yuecheng ^{[3
]}

Lou, Yunjiang ^{[1
,2
]}

机构：

[1] Harbin Inst Technol, Guangdong Key Lab Intelligent Morphing Mech & Adap, Shenzhen 518055, Peoples R China

[2] Harbin Inst Technol, Sch Mech Engn & Automat, Shenzhen 518055, Peoples R China

[3] Huawei Noahs Ark Lab, Shenzhen 518129, Peoples R China

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS | 2025年 / 36卷 / 04期

基金：

中国国家自然科学基金;

关键词：

Task analysis; Adaptation models; Robots; Reinforcement learning; Proposals; Mathematical models; Learning systems; Distributional reinforcement learning (RL); policy gradient; RL; sample mechanism;

D O I：

10.1109/TNNLS.2024.3386225

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this article, we propose a distributional policy-gradient method based on distributional reinforcement learning (RL) and policy gradient. Conventional RL algorithms typically estimate the expectation of return, given a state-action pair. Furthermore, distributional RL algorithms consider the return as a random variable and estimate the return distribution that can characterize the probability of different returns resulted by environmental uncertainties. Thus, the return distribution provides more valuable information than its expectation, leading to superior policies in general. Although distributional RL has been investigated widely in value-based RL methods, very few policy-gradient methods take advantage of distributional RL. To bridge this research gap, we propose a distributional policy-gradient method by introducing a distributional value function to the policy gradient (DVDPG). We estimate the distribution of policy gradient instead of the expectation estimated in conventional policy-gradient RL methods. Furthermore, we propose two policy-gradient value sampling mechanisms to do policy improvement. First, we propose a distribution-probability-sampling method that samples the policy-gradient value according to the quantile probability of return distribution. Second, a uniform sample mechanism is proposed. With our sample mechanisms, the proposed distributional policy-gradient method enhances the stochasticity of the policy gradient, improving the exploration efficiency and benefiting to avoid falling into local optimal solutions. In sparse-reward tasks, the distribution-probability-sampling method outperforms the uniform sample mechanism. In dense-reward tasks, the two sample mechanisms perform similarly. Moreover, we show that the conventional policy-gradient method is a special case of the proposed method. Experimental results on various sparse-reward and dense-reward OpenAI-gym tasks illustrate the efficiency of the proposed method, outperforming baselines in almost environments.

引用

页码：6556 / 6568

页数：13

共 43 条

[1]

Barth-Maron G., 2018, PROC INT C LEARN REP

[2]

Bellemare MG, 2017, PR MACH LEARN RES, V70

[3] The Arcade Learning Environment: An Evaluation Platform for General Agents [J].

Bellemare, Marc G. ;

Naddaf, Yavar ;

Veness, Joel ;

Bowling, Michael .

JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2013, 47 :253-279

[4]

Bellemare Marc G., 2023, Distributional Reinforcement Learning

[5] SOME ASYMPTOTIC THEORY FOR THE BOOTSTRAP [J].

BICKEL, PJ ;

FREEDMAN, DA .

ANNALS OF STATISTICS, 1981, 9 (06) :1196-1217

[6]

Chen X, 2023, AAAI CONF ARTIF INTE, P7078

[7]

Corrado N. E., 2023, ARXIV

[8] A distributional code for value in dopamine-based reinforcement learning [J].

Dabney, Will ;

Kurth-Nelson, Zeb ;

Uchida, Naoshige ;

Starkweather, Clara Kwon ;

Hassabis, Demis ;

Munos, Remi ;

Botvinick, Matthew .

NATURE, 2020, 577 (7792) :671-+

[9]

Dabney W, 2018, PR MACH LEARN RES, V80

[10]

Dabney W, 2018, AAAI CONF ARTIF INTE, P2892

← 1 2 3 4 5 →