A proof of convergence for stochastic gradient descent in the training of artificial neural networks with ReLU activation for constant target functions

被引:0
作者
Arnulf Jentzen
Adrian Riekert
机构
[1] The Chinese University of Hong Kong,School of Data Science and Shenzhen Research Institute of Big Data
[2] Shenzhen,Applied Mathematics: Institute for Analysis and Numerics
[3] University of Münster,undefined
来源
Zeitschrift für angewandte Mathematik und Physik | 2022年 / 73卷
关键词
Artificial intelligence; Neural networks; Stochastic gradient descent; Non-convex optimization; 68T99; 41A60; 65D15;
D O I
暂无
中图分类号
学科分类号
摘要
In this article we study the stochastic gradient descent (SGD) optimization method in the training of fully connected feedforward artificial neural networks with ReLU activation. The main result of this work proves that the risk of the SGD process converges to zero if the target function under consideration is constant. In the established convergence result the considered artificial neural networks consist of one input layer, one hidden layer, and one output layer (with d∈N\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d \in {\mathbb {N}}$$\end{document} neurons on the input layer, H∈N\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$H\in {\mathbb {N}}$$\end{document} neurons on the hidden layer, and one neuron on the output layer). The learning rates of the SGD process are assumed to be sufficiently small, and the input data used in the SGD process to train the artificial neural networks is assumed to be independent and identically distributed.
引用
收藏
相关论文
共 38 条
[1]  
Bertsekas DP(2000)Gradient convergence in gradient methods with errors SIAM J. Optim. 10 627-642
[2]  
Tsitsiklis JN(2022)A proof of convergence for gradient descent in the training of artificial neural networks for constant target functions J. Complex. 21 1-48
[3]  
Cheridito P(2020)Non-convergence of stochastic gradient descent in the training of deep neural networks J. Complex. 57 101438-444
[4]  
Jentzen A(2020)Convergence rates for the stochastic gradient descent method for non-convex objective functions J. Mach. Learn. Res. 521 436-337
[5]  
Riekert A(2020)Lower error bounds for the stochastic gradient descent optimization algorithm: sharp convergence rates for slowly and fast decaying learning rates J. Complex. 176 311-4400
[6]  
Rossmannek F(2015)Deep learning Nature 31 4394-1706
[7]  
Cheridito P(2019)First-order methods almost always avoid strict saddle points Math. Program. 28 1671-547
[8]  
Jentzen A(2020)Stochastic gradient descent for nonconvex learning without bounded gradient assumptions IEEE Trans. Neural Netw. Learn. Syst. 269 543-74
[9]  
Rossmannek F(2020)Dying ReLU and initialization: theory and numerical examples Commun. Comput. Phys. 1 39-492
[10]  
Fehrman B(1983)A method for solving the convex programming problem with convergence rate Proc. USSR Acad. Sci. 109 467-undefined