Gradient dynamics in reinforcement learning

被引:1
作者
Fabbricatore, Riccardo [1 ]
V. Palyulin, Vladimir [1 ]
机构
[1] Skolkovo Inst Sci & Technol, Moscow 121205, Russia
关键词
STATISTICAL-MECHANICS; NASH EQUILIBRIA; OPTIMIZATION;
D O I
10.1103/PhysRevE.106.025315
中图分类号
O35 [流体力学]; O53 [等离子体物理学];
学科分类号
070204 ; 080103 ; 080704 ;
摘要
Despite the success achieved by the analysis of supervised learning algorithms in the framework of statistical mechanics, reinforcement learning has remained largely untouched by physicists. Here we move towards closing the gap by analyzing the dynamics of the policy gradient algorithm. For a convex problem, namely the k-armed bandit, we show that the learning dynamics obeys a drift-diffusion motion described by a Langevin equation, the coefficients of which can be tuned by the learning rate. We explore the striking similarity between our Langevin equation and the Kimura equation, describing genotypes evolution. Furthermore, we propose a mapping between a nonconvex reinforcement learning setting describing multiple joints of a robotic arm and a disordered system, namely a p-spin glass. This mapping enables us to show how the learning rate acts as an effective temperature and thus is capable of smoothing rough landscapes, corroborating what is displayed by the drift-diffusive description and paving the way for physics-inspired algorithmic optimization based on annealing procedures in disordered systems.
引用
收藏
页数:7
相关论文
共 65 条
[1]  
Aggarwal C. C., 2018, NEURAL NETWORKS DEEP, V10, P978, DOI [DOI 10.1007/978-3-319-94463-0, 10.1007/978-3-319-94463-0]
[2]  
Amari SI, 2016, APPL MATH SCI, V194, P1, DOI 10.1007/978-4-431-55978-8
[3]   Learning dexterous in-hand manipulation [J].
Andrychowicz, Marcin ;
Baker, Bowen ;
Chociej, Maciek ;
Jozefowicz, Rafal ;
McGrew, Bob ;
Pachocki, Jakub ;
Petron, Arthur ;
Plappert, Matthias ;
Powell, Glenn ;
Ray, Alex ;
Schneider, Jonas ;
Sidor, Szymon ;
Tobin, Josh ;
Welinder, Peter ;
Weng, Lilian ;
Zaremba, Wojciech .
INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH, 2020, 39 (01) :3-20
[4]  
[Anonymous], PHYSREVE, DOI [10.1103/PhysRevE.106.025315, DOI 10.1103/PHYSREVE.106.025315]
[5]  
Baake Ellen., 2000, Annual Reviews of Computational Physics, V7, P203
[6]   Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes [J].
Baldassi, Carlo ;
Borgs, Christian ;
Chayes, Jennifer T. ;
Ingrosso, Alessandro ;
Lucibello, Carlo ;
Saglietti, Luca ;
Zecchina, Riccardo .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2016, 113 (48) :E7655-E7662
[7]   Searching for exotic particles in high-energy physics with deep learning [J].
Baldi, P. ;
Sadowski, P. ;
Whiteson, D. .
NATURE COMMUNICATIONS, 2014, 5
[8]   BROKEN SYMMETRIES IN MULTILAYERED PERCEPTRONS [J].
BARKAI, E ;
HANSEL, D ;
SOMPOLINSKY, H .
PHYSICAL REVIEW A, 1992, 45 (06) :4146-4161
[9]   A MARKOVIAN DECISION PROCESS [J].
BELLMAN, R .
JOURNAL OF MATHEMATICS AND MECHANICS, 1957, 6 (05) :679-684
[10]   Entropy and typical properties of Nash equilibria in two-player games [J].
Berg, J ;
Weigt, M .
EUROPHYSICS LETTERS, 1999, 48 (02) :129-135