Reward design for multi-agent reinforcement learning with a penalty based on the payment mechanism

被引:0
作者
Matsunami N. [1 ]
Okuhara S. [2 ]
Ito T. [2 ]
机构
[1] Department of Computer Science, Nagoya Institute of Technology
[2] Department of Social Informatics, Kyoto University
基金
日本学术振兴会;
关键词
Mechanism design; Multi-agent reinforcement learning; Vickrey-Clarke-Groves mechanism;
D O I
10.1527/tjsai.36-5_AG21-H
中图分类号
学科分类号
摘要
In this paper, we propose a novel method of reward design for multi-agent reinforcement learning (MARL). One of the main uses of MARL is building cooperative policies between self-interested agents. We take inspiration from the concept of mechanism design from game theory to modify how agents are rewarded in MARL algorithms. We defined the payment that reflects the negative contribution to other agents’ valuation in the same manner as the Vickrey-Clarke-Groves (VCG) mechanism. We give the individual learning agent a reward signal that consists of two elements. One is a reward evaluated solely on the basis of individual behavior that will follow a greedy and selfish policy, and the other is a negative reward as a penalty evaluated on the basis of the payment that will reflect the negative contribution to social welfare. We call this scheme reward design for MARL based on the payment mechanism (RDPM). We experimented with RDPM in two different scenarios. We show that RDPM can increase the social utility among agents while the other reward designs achieve far less, even for basic and simplistic problems. We finally analyze and discuss how RDPM affects the building of a cooperative policy. © 2021, Japanese Society for Artificial Intelligence. All rights reserved.
引用
收藏
相关论文
共 28 条
[1]  
Agogino A. K., Tumer K., Unifying temporal and structural credit assignment problems, Proceedings of the 3rd International Joint Conference on Autonomous Agents and Multiagent Systems, 2, pp. 980-987, (2004)
[2]  
Agogino A., Tumer K., Multi-agent reward analysis for learning in noisy domains, Proceedings of the Fourth International Joint Conference on Autonomous Agents and Multiagent Systems, (AAMAS ’05), pp. 81-88, (2005)
[3]  
Agogino A. K., Tumer K., Analyzing and visualizing multiagent rewards in dynamic and stochastic environments, Journal of Autonomous Agents and Multiagent Systems, pp. 320-338, (2008)
[4]  
Benda M., On optimal cooperation of knowledge source, (1985)
[5]  
Berner C., Brockman G., Chan B., Cheung V., Debiak P., Dennison C., Farhi D., Fischer Q., Hashme S., Hesse C., Jozefowicz R., Gray S., Olsson C., Pachocki J., Petrov M., Pinto Oliveira, de H. P., Raiman J., Salimans T., Schlatter J., Schneider J., Sidor S., Sutskever I., Tang J., Wolski F., Zhang S., Dota 2 with large scale deep reinforcement learning, CoRR, (2019)
[6]  
Bhalla S., Ganapathi Subramanian S., Crowley M., Deep multi agent reinforcement learning for autonomous driving, Advances in Artificial Intelligence, pp. 67-78, (2020)
[7]  
Bu L., Babu R., De Schutter B., Et al., A comprehensive survey of multiagent reinforcement learning, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38, 2, pp. 156-172, (2008)
[8]  
Desjardins C., Chaib-draa B., Cooperative adaptive cruise control: A reinforcement learning approach, Intelligent Transportation Systems, IEEE Transactions on, 12, pp. 1248-1260, (2012)
[9]  
Devlin S., Yliniemi L., Kudenko D., Tumer K., Potential-based difference rewards for multiagent reinforcement learning, Proceedings of the 2014 International Conference on Autonomous Agents and Multi-agent Systems, (AAMAS ’14), pp. 165-172, (2014)
[10]  
Foerster J. N., Farquhar G., Afouras T., Nardelli N., Whiteson S., Counterfactual multi-agent policy gradients, CoRR, (2017)