Controller exploitation-exploration reinforcement learning architecture for computing near-optimal policies

被引:17
作者
Asiain, Erick [1 ]
Clempner, Julio B. [2 ]
Poznyak, Alexander S. [1 ]
机构
[1] Ctr Res & Adv Studies, Dept Automat Control, Av IPN 2508, Mexico City 07360, DF, Mexico
[2] Natl Polytechn Inst, Sch Phys & Math, Edificio 9 UP Adolfo Lopez Mateos, Mexico City 07730, DF, Mexico
关键词
Reinforcement learning; Architecture; Average cost; Markov chains; Optimization; ALGORITHM;
D O I
10.1007/s00500-018-3225-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper suggests a new controller exploitation-exploration (CEE) reinforcement learning (RL) architecture that attains a near-optimal policy. The proposed architecture consists of three modules: controller, fast-tracked learning and the actor-critic. The strategies are represented by a probability distribution cik. The controller employs a combination (balance) of the exploration or exploitation using the Kullback-Leibler divergence deciding if the new strategies are better than currently employed immediate strategy. The exploitation uses a fast-tracked learning algorithm, which employs a fix strategy and priori knowledge. The method is (only) asked to find estimated values of the transition matrices and utilities. The exploration employs an actor-critic architecture. The actor is responsible for the computation of the strategies using a policy gradient method. The critic determines the acceptance of the proposed strategies. We show the convergence of the proposed algorithms for implementing the architecture. An application example related to inventory shows the effectiveness of the proposed architecture.
引用
收藏
页码:3591 / 3604
页数:14
相关论文
共 24 条
  • [1] A novel hybridization strategy for krill herd algorithm applied to clustering techniques
    Abualigah, Laith Mohammad
    Khader, Ahamad Tajudin
    Hanandeh, Essam Said
    Gandomi, Amir H.
    [J]. APPLIED SOFT COMPUTING, 2017, 60 : 423 - 435
  • [2] Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering
    Abualigah, Laith Mohammad
    Khader, Ahamad Tajudin
    [J]. JOURNAL OF SUPERCOMPUTING, 2017, 73 (11) : 4773 - 4795
  • [3] Auer P., 2006, P 20 ANN C NEUR INF, P49
  • [4] Brafman R. I., 2002, RES, V3, P213
  • [5] A near-optimal polynomial time algorithm for learning in certain classes of stochastic games
    Brafman, RI
    Tennenholtz, M
    [J]. ARTIFICIAL INTELLIGENCE, 2000, 121 (1-2) : 31 - 47
  • [6] Castronovo Michael., 2013, PMLR, P1
  • [7] Simple computing of the customer lifetime value: A fixed local-optimal policy approach
    Clempner, Julio B.
    Poznyak, Alexander S.
    [J]. JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING, 2014, 23 (04) : 439 - 459
  • [8] A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients
    Grondman, Ivo
    Busoniu, Lucian
    Lopes, Gabriel A. D.
    Babuska, Robert
    [J]. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND REVIEWS, 2012, 42 (06): : 1291 - 1307
  • [9] Control of exploitation-exploration meta-parameter in reinforcement learning
    Ishii, S
    Yoshida, W
    Yoshimoto, J
    [J]. NEURAL NETWORKS, 2002, 15 (4-6) : 665 - 687
  • [10] Jaksch T, 2010, J MACH LEARN RES, V11, P1563