Reward-Filtering-Based Credit Assignment for Multi-Agent Deep Reinforcement Learning

被引：0

作者：

Xu C. ^{[1
,2
]}

Yin N. ^{[1
]}

Duan S.-H. ^{[1
,2
]}

He H. ^{[1
]}

Wang R. ^{[1
,2
]}

机构：

[1] School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing

[2] Shunde Graduate School, University of Science and Technology Beijing, Foshan

来源：

Jisuanji Xuebao/Chinese Journal of Computers | 2022年 / 45卷 / 11期

基金：

中国国家自然科学基金; 中国博士后科学基金;

关键词：

Cooperative navigation; Credit assignment; Deep reinforcement learning; Multi-agent system; Reward filtering;

D O I：

10.11897/SP.J.1016.2022.02306

中图分类号：

学科分类号：

摘要：

In recent decades, reinforcement learning has achieved remarkable successes in many fields such as intelligent traffic control, competitive gaming, unmanned system positioning, and navigation. As more and more realistic scenarios require multi-agent to undertake complex tasks cooperatively, researchers pay more attention to studying multi-agent than single agents in reinforcement learning. At the same time, in multi-agent reinforcement learning (MARL), learning cooperation is a new research hotspot, which means agents need to learn to cooperate using only actions and local observations. However, the credit assignment problem needs to be solved when studying the cooperative process of a multi-agent system with DRL. In the process of learning to complete tasks, the partially observable environment provides reward reinforcement signals for the joint actions produced by agents, which are used to update the parameters of the deep reinforcement learning network. But the global reward is non-Markovian. When an agent takes action at the current state, the actual reward signal for this action is usually given after several time steps. Especially in a difficult multi-agent environment, this phenomenon is more serious. In addition, all agents share the same global reward, making it hard to determine how much each agent in the system contributes to the whole. When an agent learns the strategy well in advance and gains a high return, the others may stop exploring, which leads the whole system to trap in the local optimum. To solve these problems, this paper introduces a credit assignment algorithm based on reward filtering that is not restricted by action space. The goal is to restore the local reward of each agent from the global reward obtained by all agents and apply it to the training of the action-value function network. The exploration behaviour of other agents often causes the non-stationarity of the environment, and the agent's own reward signal can be obtained by removing the influence of non-stationary from the global reward. Based on this, starting from the global reward, we model the influence caused by non-stationary factors as noise and propose a reward filter-based estimation mechanism to update the value function. In the process of centralized training, the influence of other agents on the environment is modelled as noise. The local reward of each agent is obtained by filtering the global reward, which is used to coordinate the behaviours of multi-agents and improve the system reward. We also propose a multi-agent deep reinforcement learning framework based on reward filtering (RF-MADRL) and successfully validate it in cooperative and competitive environments, namely the cooperative navigation with obstacles and predator-prey environments of Open AI. The experimental results show that compared with the baseline methods, including the traditional MADDPG method, value-function-based algorithms (i.e., VDN, QMIX, and QTRAN) and actor-critic-based credit assignment algorithms (i.e., COMA, FacMADDPG, and MAAC), our proposed RFMADRL has a better performance. The policy convergence is faster, and the reward obtained by the agent system is higher. The ablation experiment analysis shows that the reward filter module effectively improves the agent system's reward and solves the credit assignment problem. © 2022, Science Press. All right reserved.

引用

页码：2306 / 2320

页数：14

共 44 条

[1]

Busoniu L, Babuska R, De Schutter B., A comprehensive survey of multiagent reinforcement learning, IEEE Transactions on Systems, 38, 2, pp. 156-172, (2008)

[2]

Gupta J K, Egorov M, Kochenderfer M., Cooperative multi-agent control using deep reinforcement learning, Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, pp. 66-83, (2017)

[3]

Nguyen T T, Nguyen N D, Nahavandi S., Deep reinforcement learning for multiagent systems: A review of challenges, solutions, and applications, IEEE Transactions on Cybernetics, 50, 9, pp. 3826-3839, (2020)

[4]

Luis C E, Schoellig A P., Trajectory generation for multiagent point-to-point transitions via distributed model predictive control, IEEE Robotics and Automation Letters, 4, 2, pp. 375-382, (2019)

[5]

Niroui F, Zhang K, Kashino Z, Et al., Deep reinforcement learning robot for search and rescue applications: Exploration in unknown cluttered environments, IEEE Robotics and Automation Letters, 4, 2, pp. 610-617, (2019)

[6]

Peake A, McCalmon J, Zhang Y, Et al., Wilderness search and rescue missions using deep reinforcement learning, Proceedings of the 2020 IEEE International Symposium on Safety, Security, and Rescue Robotics(SSRR), pp. 102-107, (2020)

[7]

Cao Y, Yu W, Ren W, Et al., An overview of recent progress in the study of distributed multi-agent coordination, IEEE Transactions on Industrial Informatics, 9, 1, pp. 427-438, (2012)

[8]

Ying W, Dayong S., Multi-agent framework for third party logistics in E-commerce, Expert Systems with Applications, 29, 2, pp. 431-436, (2005)

[9]

Tan M., Multi-agent reinforcement learning: Independent vs. cooperative agents, Proceedings of the 10th International Conference on Machine Learning, pp. 330-337, (1993)

[10]

Yu J, LaValle S M., Optimal multirobot path planning on graphs: Complete algorithms and effective heuristics, IEEE Transactions on Robotics, 32, 5, pp. 1163-1177, (2016)

← 1 2 3 4 5 →