Reinforcement learning algorithm for non-stationary environments

被引：63

作者：

Padakandla, Sindhu ^{[1
]}

Prabuchandran, K. J. ^{[1
]}

Bhatnagar, Shalabh ^{[1
]}

机构：

[1] Indian Inst Sci, Dept Comp Sci & Automat, Bangalore, Karnataka, India

来源：

APPLIED INTELLIGENCE | 2020年 / 50卷 / 11期

关键词：

Markov decision processes; Reinforcement learning; Non-Stationary environments; Change detection; CLASSIFIERS;

D O I：

10.1007/s10489-020-01758-5

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Reinforcement learning (RL) methods learn optimal decisions in the presence of a stationary environment. However, the stationary assumption on the environment is very restrictive. In many real world problems like traffic signal control, robotic applications, etc., one often encounters situations with non-stationary environments, and in these scenarios, RL methods yield sub-optimal decisions. In this paper, we thus consider the problem of developing RL methods that obtain optimal decisions in a non-stationary environment. The goal of this problem is to maximize the long-term discounted reward accrued when the underlying model of the environment changes over time. To achieve this, we first adapt a change point algorithm to detect change in the statistics of the environment and then develop an RL algorithm that maximizes the long-run reward accrued. We illustrate that our change point method detects change in the model of the environment effectively and thus facilitates the RL algorithm in maximizing the long-run reward. We further validate the effectiveness of the proposed solution on non-stationary random Markov decision processes, a sensor energy management problem, and a traffic signal control problem.

引用

页码：3590 / 3606

页数：17

共 46 条

[1] Abdallah S, 2016, J MACH LEARN RES, V17
[2] Learning algorithms or Markov decision processes with average cost
Abounadi, J
Bertsekas, D
Borkar, VS
[J]. SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2001, 40 (03) : 681 - 698
[3] Learning dexterous in-hand manipulation
Andrychowicz, Marcin
Baker, Bowen
Chociej, Maciek
Jozefowicz, Rafal
McGrew, Bob
Pachocki, Jakub
Petron, Arthur
Plappert, Matthias
Powell, Glenn
Ray, Alex
Schneider, Jonas
Sidor, Szymon
Tobin, Josh
Welinder, Peter
Weng, Lilian
Zaremba, Wojciech
[J]. INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH, 2020, 39 (01) : 3 - 20
[4] [Anonymous], 2000, SEQUENCE LEARNING
[5] Banerjee T, 2017, P AMER CONTR CONF, P399, DOI 10.23919/ACC.2017.7962986
[6] Bertsekas D. P., 2013, DYNAMIC PROGRAMMING, VII
[7] Evolving rule-based classifiers with genetic programming on GPUs for drifting data streams
Cano, Alberto
Krawczyk, Bartosz
[J]. PATTERN RECOGNITION, 2019, 87 : 248 - 268
[8] Choi SPM, 2000, ADV NEUR IN, V12, P987
[9] Csáji BC, 2008, J MACH LEARN RES, V9, P1679
[10] Da Silva B. C., 2006, ACM INT C PROCEEDING, P217, DOI DOI 10.1145/1143844.1143872

← 1 2 3 4 5 →