Convergence rates of training deep neural networks via alternating minimization methods

被引:0
作者
Jintao Xu
Chenglong Bao
Wenxun Xing
机构
[1] Tsinghua University,Department of Mathematical Sciences
[2] Tsinghua University,Yau Mathematical Sciences Center
[3] Yanqi Lake Beijing Institute of Mathematical Sciences and Applications,undefined
来源
Optimization Letters | 2024年 / 18卷
关键词
Deep neural networks training; Alternating minimization; Kurdyka-Łojasiewicz property; Non-monotone ; -step sufficient decrease; Convergence rate;
D O I
暂无
中图分类号
学科分类号
摘要
Training deep neural networks (DNNs) is an important and challenging optimization problem in machine learning due to its non-convexity and non-separable structure. The alternating minimization (AM) approaches split the composition structure of DNNs and have drawn great interest in the deep learning and optimization communities. In this paper, we propose a unified framework for analyzing the convergence rate of AM-type network training methods. Our analysis is based on the non-monotone j-step sufficient decrease conditions and the Kurdyka-Łojasiewicz (KL) property, which relaxes the requirement of designing descent algorithms. We show the detailed local convergence rate if the KL exponent θ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta$$\end{document} varies in [0, 1). Moreover, the local R-linear convergence is discussed under a stronger j-step sufficient decrease condition.
引用
收藏
页码:909 / 923
页数:14
相关论文
共 41 条
[1]  
Attouch H(2009)On the convergence of the proximal algorithm for nonsmooth functions involving analytic features Math. Program. 116 5-16
[2]  
Bolte J(2010)Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka-Łojasiewicz inequality Math. Oper. Res. 35 438-457
[3]  
Attouch H(2013)Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods Math. Program. 137 91-129
[4]  
Bolte J(1994)Learning long-term dependencies with gradient descent is difficult IEEE Trans. Neural Netw. 5 157-166
[5]  
Redont P(2007)The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems SIAM J. Opt. 17 1205-1223
[6]  
Soubeyran A(2014)Proximal alternating linearized minimization for nonconvex and nonsmooth problems Math. Program. 146 459-494
[7]  
Attouch H(2011)Distributed optimization and statistical learning via the alternating direction method of multipliers Found. Trends Mach. Learn. 3 1-122
[8]  
Bolte J(2015)Splitting methods with variable metric for Kurdyka-Łojasiewicz functions and general convergence rates J. Opt. Theory Appl. 165 874-900
[9]  
Svaiter BF(1998)On gradients of functions definable in o-minimal structures Ann. Inst. Fourier 48 769-783
[10]  
Bengio Y(2016)Douglas-Rachford splitting for nonconvex optimization with application to nonconvex feasibility problems Math. Program. 159 371-401