Clustering experience replay for the effective exploitation in reinforcement learning

被引：20

作者：

Li, Min ^{[1
]}

Huang, Tianyi ^{[1
]}

Zhu, William ^{[1
]}

机构：

[1] Univ Elect Sci & Technol China, Inst Fundamental & Frontier Sci, Chengdu 610054, Peoples R China

来源：

PATTERN RECOGNITION | 2022年 / 131卷

关键词：

Reinforcement learning; Clustering; Experience replay; Exploitation efficiency; Time division;

D O I：

10.1016/j.patcog.2022.108875

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Reinforcement learning is a useful tool for training an agent to effectively achieve the desired goal in the sequential decision-making problem. It trains the agent to make decision by exploiting the experience in the transitions resulting from the different decisions. To exploit this experience, most reinforcement learning methods replay the explored transitions by uniform sampling. But in this way, it is easy to ig-nore the last explored transitions. Another way to exploit this experience defines the priority of each transition by the estimation error in training and then replays the transitions according to their priori-ties. But it only updates the priorities of the transitions replayed at the current training time step, thus the transitions with low priorities will be ignored. In this paper, we propose a clustering experience re-play, called CER, to effectively exploit the experience hidden in all explored transitions in the current training. CER clusters and replays the transitions by a divide-and-conquer framework based on time di-vision as follows. Firstly, it divides the whole training process into several periods. Secondly, at the end of each period, it uses k-means to cluster the transitions explored in this period. Finally, it constructs a conditional probability density function to ensure that all kinds of transitions will be sufficiently replayed in the current training. We construct a new method, TD3 _ CER, to implement our clustering experience replay on TD3. Through the theoretical analysis and experiments, we illustrate that our TD3 _ CER is more effective than the existing reinforcement learning methods. The source code can be downloaded from https://github.com/grcai/CER-Master .(c) 2022 Elsevier Ltd. All rights reserved.

引用

页数：9

共 39 条

[1] Experience Replay for Real-Time Reinforcement Learning Control [J].

Adam, Sander ;

Busoniu, Lucian ;

Babuska, Robert .

IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND REVIEWS, 2012, 42 (02) :201-212

[2]

Bertsekas DP, 1995, PROCEEDINGS OF THE 34TH IEEE CONFERENCE ON DECISION AND CONTROL, VOLS 1-4, P560, DOI 10.1109/CDC.1995.478953

[3]

Brockman G, 2016, Arxiv, DOI arXiv:1606.01540

[4]

Folkers A, 2019, IEEE INT VEH SYM, P2025, DOI [10.1109/ivs.2019.8814124, 10.1109/IVS.2019.8814124]

[5] How much can k-means be improved by using better initialization and repeats? [J].

Franti, Pasi ;

Sieranoja, Sami .

PATTERN RECOGNITION, 2019, 93 :95-112

[6]

Fujimoto S., 2020, ADV NEURAL INFORM PR, P14219, DOI DOI 10.48550/ARXIV.2007.06049

[7]

Fujimoto S, 2018, PR MACH LEARN RES, V80

[8]

Gehring C., 2013, P 2013 INT C AUTONOM, P1037

[9] A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients [J].

Grondman, Ivo ;

Busoniu, Lucian ;

Lopes, Gabriel A. D. ;

Babuska, Robert .

IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND REVIEWS, 2012, 42 (06) :1291-1307

[10]

Haarnoja T, 2018, PR MACH LEARN RES, V80

← 1 2 3 4 →