Compositional Reinforcement Learning for Discrete-Time Stochastic Control Systems

被引:0
作者
Lavaei, Abolfazl [1 ]
Perez, Mateo [2 ]
Kazemi, Milad [3 ]
Somenzi, Fabio [4 ]
Soudjani, Sadegh [1 ]
Trivedi, Ashutosh
Zamani, Majid
机构
[1] Newcastle Univ, Sch Comp, Newcastle Upon Tyne NE4 5TG, England
[2] Univ Colorado Boulder, Dept Comp Sci, Boulder, CO 80309 USA
[3] Kings Coll London, Dept Informat, London WC2R 2LS, England
[4] Univ Colorado Boulder, Dept Elect Comp & Energy Engn, Boulder, CO 80309 USA
来源
IEEE OPEN JOURNAL OF CONTROL SYSTEMS | 2023年 / 2卷
基金
英国工程与自然科学研究理事会;
关键词
Control systems; Games; Stochastic systems; Reinforcement learning; Convergence; Computational modeling; Optimization; Compositional controller synthesis; minimax-Q learning; reinforcement learning; stochastic control systems; FINITE MDPS; CONSTRUCTION; VERIFICATION; ABSTRACTION;
D O I
10.1109/OJCSYS.2023.3329394
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We propose a compositional approach to synthesize policies for networks of continuous-space stochastic control systems with unknown dynamics using model-free reinforcement learning (RL). The approach is based on implicitly abstracting each subsystem in the network with a finite Markov decision process with unknown transition probabilities, synthesizing a strategy for each abstract model in an assume-guarantee fashion using RL, and then mapping the results back over the original network with approximate optimality guarantees. We provide lower bounds on the satisfaction probability of the overall network based on those over individual subsystems. A key contribution is to leverage the convergence results for adversarial RL (minimax Q-learning) on finite stochastic arenas to provide control strategies maximizing the probability of satisfaction over the network of continuous-space systems. We consider finite-horizon properties expressed in the syntactically co-safe fragment of linear temporal logic. These properties can readily be converted into automata-based reward functions, providing scalar reward signals suitable for RL. Since such reward functions are often sparse, we supply a potential-based reward shaping technique to accelerate learning by producing dense rewards. The effectiveness of the proposed approaches is demonstrated via two physical benchmarks including regulation of a room temperature network and control of a road traffic network.
引用
收藏
页码:425 / 438
页数:14
相关论文
共 43 条
[1]  
Baier C, 2008, PRINCIPLES OF MODEL CHECKING, P1
[2]  
Cheng R, 2019, AAAI CONF ARTIF INTE, P3387
[3]   THE COMPLEXITY OF PROBABILISTIC VERIFICATION [J].
COURCOUBETIS, C ;
YANNAKAKIS, M .
JOURNAL OF THE ASSOCIATION FOR COMPUTING MACHINERY, 1995, 42 (04) :857-907
[4]   A stochastic games framework for verification and control of discrete time stochastic hybrid systems [J].
Ding, Jerry ;
Kamgarpour, Maryam ;
Summers, Sean ;
Abate, Alessandro ;
Lygeros, John ;
Tomlin, Claire .
AUTOMATICA, 2013, 49 (09) :2665-2674
[5]  
Filar J, 1997, Competitive Markov Decision Processes
[6]   Robust Dynamic Programming for Temporal Logic Control of Stochastic Systems [J].
Haesaert, Sofie ;
Soudjani, Sadegh .
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2021, 66 (06) :2496-2511
[7]  
Hahn E. M., 2020, P 31 INT C CONC THEO
[8]  
Hasanbeig M, 2019, IEEE DECIS CONTR P, P5338, DOI 10.1109/CDC40024.2019.9028919
[9]  
Helms T., 2016, P 9 EAI INT C PERF E, P149
[10]  
Icarte RT, 2018, PR MACH LEARN RES, V80