Asymptotic Analysis via Stochastic Differential Equations of Gradient Descent Algorithms in Statistical and Computational Paradigms

被引:0
作者
Wang, Yazhen [1 ]
Wu, Shang [2 ]
机构
[1] Univ Wisconsin, Dept Stat, Madison, WI 53706 USA
[2] Fudan Univ, Dept Stat, Sch Management, Shanghai 200433, Peoples R China
基金
美国国家科学基金会;
关键词
acceleration; gradient descent; gradient flow central limit theorem; joint asymptotic analysis; joint computational and statistical analysis; Lagrangian flow central limit theorem; mini-batch; optimization; ordinary differential equation; stochastic differential equation; stochastic gradient descent; weak convergence; PARTIAL-SUM PROCESSES; STRONG APPROXIMATION; INFERENCE;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper investigates the asymptotic behaviors of gradient descent algorithms (particularly accelerated gradient descent and stochastic gradient descent) in the context of stochastic optimization arising in statistics and machine learning, where objective functions are estimated from available data. We show that these algorithms can be computationally modeled by continuous-time ordinary or stochastic differential equations. We establish gradient flow central limit theorems to describe the limiting dynamic behaviors of these computational algorithms and the large-sample performances of the related statistical procedures, as the number of algorithm iterations and data size both go to infinity, where the gradient flow central limit theorems are governed by some linear ordinary or stochastic differential equations, like time-dependent Ornstein-Uhlenbeck processes. We illustrate that our study can provide a novel unified framework for a joint computational and statistical asymptotic analysis, where the computational asymptotic analysis studies the dynamic behaviors of these algorithms with time (or the number of iterations in the algorithms), the statistical asymptotic analysis investigates the large-sample behaviors of the statistical procedures (like estimators and classifiers) that are computed by applying the algorithms; in fact, the statistical procedures are equal to the limits of the random sequences generated from these iterative algorithms, as the number of iterations goes to infinity. The joint analysis results based on the obtained gradient flow central limit theorems lead to the identification of four factors-learning rate, batch size, gradient covariance, and Hessian-to derive new theories regarding the local minima found by stochastic gradient descent for solving non-convex optimization problems.
引用
收藏
页数:103
相关论文
共 76 条
[1]  
Ali A., 2019, INT C ART INT STAT
[2]  
[Anonymous], 2023, Empirical Processes
[3]  
[Anonymous], FOUND TRENDS MACH LE
[4]  
[Anonymous], 2014, Convex Optimiza- tion
[5]  
[Anonymous], 2017, ARXIV170300887
[6]  
[Anonymous], 2017, ICLR 2017 5 INT C LE, DOI [DOI 10.48550/ARXIV.1609.04836, 10.48550/arxiv.1609.04836]
[7]  
[Anonymous], 2017, ARXIV171004273
[8]  
[Anonymous], 2017, ARXIV170507562
[9]  
[Anonymous], 2008, Numerical Methods for Ordinary Differential Equations
[10]  
[Anonymous], 1999, Convergence of Probability Measures