Fine-tuning adaptive stochastic optimizers: determining the optimal hyperparameter ϵ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon$$\end{document} via gradient magnitude histogram analysis

被引:0
作者
Gustavo Silva [1 ]
Paul Rodriguez [1 ]
机构
[1] Pontificia Universidad Católica del Perú,Department of Electrical Engineering
关键词
Hyperparameter; Fine-tuning; Stochastic optimizers; Deep neural network;
D O I
10.1007/s00521-024-10302-2
中图分类号
学科分类号
摘要
Stochastic optimizers play a crucial role in the successful training of deep neural network models. To achieve optimal model performance, designers must carefully select both model and optimizer hyperparameters. However, this process is frequently demanding in terms of computational resources and processing time. While it is a well-established practice to tune the entire set of optimizer hyperparameters for peak performance, there is still a lack of clarity regarding the individual influence of hyperparameters mislabeled as “low priority”, including the safeguard factor ϵ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon$$\end{document} and decay rate β\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta$$\end{document}, in leading adaptive stochastic optimizers like the Adam optimizer. In this manuscript, we introduce a new framework based on the empirical probability density function of the loss’ gradient magnitude, termed as the “gradient magnitude histogram”, for a thorough analysis of adaptive stochastic optimizers and the safeguard hyperparameter ϵ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon$$\end{document}. This framework reveals and justifies valuable relationships and dependencies among hyperparameters in connection to optimal performance across diverse tasks, such as classification, language modeling and machine translation. Furthermore, we propose a novel algorithm using gradient magnitude histograms to automatically estimate a refined and accurate search space for the optimal safeguard hyperparameter ϵ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon$$\end{document}, surpassing the conventional trial-and-error methodology by establishing a worst-case search space that is two times narrower.
引用
收藏
页码:22223 / 22243
页数:20
相关论文
共 5 条
[1]   Statistical solution to SDEs with α\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha $$\end{document}-stable Lévy noise via deep neural network [J].
Hao Zhang ;
Yong Xu ;
Yongge Li ;
Jürgen Kurths .
International Journal of Dynamics and Control, 2020, 8 (4) :1129-1140
[2]   Artificial Bandwidth Extension using Frequency Shifting, H∞\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$H^\infty $$\end{document} Optimization, and Deep Neural Network [J].
Deepika Gupta ;
Hanumant Singh Shekhawat .
Circuits, Systems, and Signal Processing, 2025, 44 (5) :3088-3111
[3]   Deep neural network-based pulse shape discrimination of neutrons and γ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document}-rays in organic scintillation detectors [J].
Annesha Karmakar ;
Anikesh Pal ;
G Anil Kumar ;
V Bhavika ;
Mohit Anand .
Pramana, 97 (4)
[4]   Identification of Additional Jets in the tt¯bb¯\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\rm{t}}{\rm\bar{t}b}{\rm\bar{b}}$$\end{document} Events by Using Deep Neural Network [J].
Jieun Choi ;
Tae Jeong Kim ;
Jongwon Lim ;
Jiwon Park ;
Yeonsu Ryou ;
Juhee Song ;
Soohyun Yun .
Journal of the Korean Physical Society, 2020, 77 (12) :1100-1106
[5]   LO2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal{L}\mathcal{O}^2$$\end{document}net: Global–Local Semantics Coupled Network for scene-specific video foreground extraction with less supervision [J].
Tao Ruan ;
Shikui Wei ;
Yao Zhao ;
Baoqing Guo ;
Zujun Yu .
Pattern Analysis and Applications, 2023, 26 (4) :1671-1683