Distributed Off-Policy Temporal Difference Learning Using Primal-Dual Method

被引：2

作者：

Lee, Donghwan ^{[1
]}

Kim, Do Wan ^{[2
]}

Hu, Jianghai ^{[3
]}

机构：

[1] Korea Adv Inst Sci & Technol, Dept Elect Engn, Daejeon 34141, South Korea

[2] Hanbat Natl Univ, Dept Elect Engn, Daejeon 34158, South Korea

[3] Purdue Univ, Dept Elect & Comp Engn, W Lafayette, IN 47906 USA

来源：

IEEE ACCESS | 2022年 / 10卷

基金：

美国国家科学基金会; 新加坡国家研究基金会;

关键词：

Convergence; Linear programming; Optimization; Markov processes; Symmetric matrices; Communication networks; Reinforcement learning; Machine learning; Sequential analysis; Multi-agent systems; Optimal control; Distributed processing; Reinforcement learning (RL); multi-agent systems; convergence; temporal difference (TD) learning; machine learning; primal-dual method; OPTIMIZATION; ALGORITHM; CONSENSUS;

D O I：

10.1109/ACCESS.2022.3211395

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The goal of this paper is to provide theoretical analysis and additional insights on a distributed temporal-difference (TD)-learning algorithm for the multi-agent Markov decision processes (MDPs) via saddle-point viewpoints. The (single-agent) TD-learning is a reinforcement learning (RL) algorithm for evaluating a given policy based on reward feedbacks. In multi-agent settings, multiple RL agents concurrently behave, and each agent receives its local rewards. The goal of each agent is to evaluate a given policy corresponding to the global reward, which is an average of the local rewards by sharing learning parameters through random network communications. In this paper, we propose a distributed TD-learning based on saddle-point frameworks, and provide rigorous analysis of finite-time convergence of the algorithm and its solution based on tools in optimization theory. The results in this paper provide general and unified perspectives of the distributed policy evaluation problem, and theoretically complement the previous works.

引用

页码：107077 / 107094

页数：18

共 51 条

[21]

Kushner H., 2003, Stochastic approximation and recursive algorithms and applications, V35

[22] Optimization for Reinforcement Learning: From a single agent to cooperative agents [J].

Lee, Donghwan ;

He, Niao ;

Kamalaruban, Parameswaran ;

Cevher, Volkan .

IEEE SIGNAL PROCESSING MAGAZINE, 2020, 37 (03) :123-135

[23]

Lee D, 2018, IEEE DECIS CONTR P, P1967, DOI 10.1109/CDC.2018.8619839

[24] Primal-dual algorithm for distributed constrained optimization [J].

Lei, Jinlong ;

Chen, Han-Fu ;

Fang, Hai-Tao .

SYSTEMS & CONTROL LETTERS, 2016, 96 :110-117

[25]

Lin Q., 2018, arXiv

[26] Distributed Subgradient Methods for Convex Optimization Over Random Networks [J].

Lobel, Ilan ;

Ozdaglar, Asuman .

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2011, 56 (06) :1291-1306

[27]

Macua SV, 2020, Arxiv, DOI [arXiv:1710.10363, DOI 10.48550/ARXIV.1710.10363]

[28]

Maei HH, 2009, ADV NEURAL INFORM PR, V22, P1

[29]

Mahadevan S, 2014, Arxiv, DOI arXiv:1405.6757

[30] Human-level control through deep reinforcement learning [J].

Mnih, Volodymyr ;

Kavukcuoglu, Koray ;

Silver, David ;

Rusu, Andrei A. ;

Veness, Joel ;

Bellemare, Marc G. ;

Graves, Alex ;

Riedmiller, Martin ;

Fidjeland, Andreas K. ;

Ostrovski, Georg ;

Petersen, Stig ;

Beattie, Charles ;

Sadik, Amir ;

Antonoglou, Ioannis ;

King, Helen ;

Kumaran, Dharshan ;

Wierstra, Daan ;

Legg, Shane ;

Hassabis, Demis .

NATURE, 2015, 518 (7540) :529-533

← 1 2 3 4 5 6 →