Multiagent Meta-Reinforcement Learning for Adaptive Multipath Routing Optimization

被引：58

作者：

Chen, Long ^{[1
]}

Hu, Bin ^{[1
]}

Guan, Zhi-Hong ^{[1
]}

Zhao, Lian ^{[2
]}

Shen, Xuemin ^{[3
]}

机构：

[1] Huazhong Univ Sci & Technol, Sch Artificial Intelligence & Automat, Wuhan 430074, Peoples R China

[2] Ryerson Univ, Dept Elect Comp & Biomed Engn, Toronto, ON M5B 2K3, Canada

[3] Univ Waterloo, Dept Elect & Comp Engn, Waterloo, ON N2L 3G1, Canada

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS | 2022年 / 33卷 / 10期

基金：

中国国家自然科学基金;

关键词：

Routing; Optimization; Task analysis; Heuristic algorithms; Spread spectrum communication; Routing protocols; Training; Adaptive routing; metapolicy gradient; multiagent; reinforcement learning (RL); NETWORKS; CHALLENGES; ENERGY;

D O I：

10.1109/TNNLS.2021.3070584

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this article, we investigate the routing problem of packet networks through multiagent reinforcement learning (RL), which is a very challenging topic in distributed and autonomous networked systems. In specific, the routing problem is modeled as a networked multiagent partially observable Markov decision process (MDP). Since the MDP of a network node is not only affected by its neighboring nodes' policies but also the network traffic demand, it becomes a multitask learning problem. Inspired by recent success of RL and metalearning, we propose two novel model-free multiagent RL algorithms, named multiagent proximal policy optimization (MAPPO) and multiagent metaproximal policy optimization (meta-MAPPO), to optimize the network performances under fixed and time-varying traffic demand, respectively. A practicable distributed implementation framework is designed based on the separability of exploration and exploitation in training MAPPO. Compared with the existing routing optimization policies, our simulation results demonstrate the excellent performances of the proposed algorithms.

引用

页码：5374 / 5386

页数：13

共 49 条

[1]

[Anonymous], 1994, P ADV NEUR INF PROC

[2] The complexity of decentralized control of Markov decision processes [J].

Bernstein, DS ;

Givan, R ;

Immerman, N ;

Zilberstein, S .

MATHEMATICS OF OPERATIONS RESEARCH, 2002, 27 (04) :819-840

[3]

Choi SPM, 1996, ADV NEUR IN, V8, P945

[4] CARMA: Channel-Aware Reinforcement Learning-Based Multi-Path Adaptive Routing for Underwater Wireless Sensor Networks [J].

Di Valerio, Valerio ;

Lo Presti, Francesco ;

Petrioli, Chiara ;

Picari, Luigi ;

Spaccini, Daniele ;

Basagni, Stefano .

IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2019, 37 (11) :2634-2647

[5] Deep Reinforcement Learning for Router Selection in Network With Heavy Traffic [J].

Ding, Ruijin ;

Xu, Yadong ;

Gao, Feifei ;

Shen, Xuemin ;

Wu, Wen .

IEEE ACCESS, 2019, 7 :37109-37120

[6] Using feedback in collaborative reinforcement learning to adaptively optimize MANET routing [J].

Dowling, J ;

Curran, E ;

Cunningham, R ;

Cahill, V .

IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART A-SYSTEMS AND HUMANS, 2005, 35 (03) :360-372

[7]

Duan Y., 2016, ARXIV161102779

[8]

Finn C, 2017, PR MACH LEARN RES, V70

[9] Optimizing OSPF/IS-IS weights in a changing world [J].

Fortz, B ;

Thorup, M .

IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2002, 20 (04) :756-767

[10] Contention Intensity Based Distributed Coordination for V2V Safety Message Broadcast [J].

Gao, Jie ;

Li, Mushu ;

Zhao, Lian ;

Shen, Xuemin .

IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, 2018, 67 (12) :12288-12301

← 1 2 3 4 5 →