Random Orthogonal Additive Filters: A Solution to the Vanishing/Exploding Gradient of Deep Neural Networks

被引:1
作者
Ceni, Andrea [1 ]
机构
[1] Univ Pisa, Dept Comp Sci, I-56127 Pisa, Italy
基金
英国工程与自然科学研究理事会;
关键词
Artificial neural networks; Training; Vectors; Computer architecture; Computational modeling; Additives; Neurons; Mathematical models; Jacobian matrices; Filters; Deep learning; exploding gradient; machine learning; recurrent neural networks (RNNs); vanishing gradient;
D O I
10.1109/TNNLS.2025.3538924
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Since the recognition in the early 1990s of the vanishing/exploding (V/E) gradient issue plaguing the training of neural networks (NNs), significant efforts have been exerted to overcome this obstacle. However, a clear solution to the V/E issue remained elusive so far. The pursuit of approximate dynamical isometry, i.e., parameter configurations where the singular values of the input-output Jacobian (IOJ) are tightly distributed around 1, leads to the derivation of an NN's architecture that shares common traits with the popular residual network (ResNet) model. Instead of skipping connections between layers, the idea is to filter the previous activations orthogonally and add them to the nonlinear activations of the next layer, realizing a convex combination between them. Remarkably, the impossibility of the gradient updates to either vanish or explode is demonstrated with analytical bounds that hold even in the infinite depth case. The effectiveness of this method is empirically proved by means of training via backpropagation an extremely deep multilayer perceptron (MLP) of 50k layers, and an Elman NN to learn long-term dependencies in the input of 10k time steps in the past. Compared with other architectures specifically devised to deal with the V/E problem, e.g., LSTMs, the proposed model is way simpler yet more effective. Surprisingly, a single-layer vanilla recurrent NN (RNN) can be enhanced to reach state-of-the-art performance, while converging super fast; for instance, on the psMNIST task, it is possible to get test accuracy of over 94% in the first epoch, and over 98% after just ten epochs.
引用
收藏
页码:10794 / 10807
页数:14
相关论文
共 57 条
[1]  
[Anonymous], 2016, P C EMP METH NAT LAN
[2]  
[Anonymous], 2010, Neural Networks and Learning Machines, 3/E
[3]  
Arjovsky M, 2016, PR MACH LEARN RES, V48
[4]   Real-time computation at the edge of chaos in recurrent neural networks [J].
Bertschinger, N ;
Natschläger, T .
NEURAL COMPUTATION, 2004, 16 (07) :1413-1436
[5]  
Borawar Lokesh, 2023, Proceedings of International Conference on Recent Trends in Computing: ICRTC 2022. Lecture Notes in Networks and Systems (600), P235, DOI 10.1007/978-981-19-8825-7_21
[6]  
Ceni A., 2022, P EUR S ART NEUR NET, P1
[7]   Residual Echo State Networks: Residual recurrent neural networks with stable dynamics and fast learning [J].
Ceni, Andrea ;
Gallicchio, Claudio .
NEUROCOMPUTING, 2024, 597
[8]   Edge of Stability Echo State Network [J].
Ceni, Andrea ;
Gallicchio, Claudio .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2025, 36 (04) :7555-7564
[9]   The echo index and multistability in input-driven recurrent neural networks [J].
Ceni, Andrea ;
Ashwin, Peter ;
Livi, Lorenzo ;
Postlethwaite, Claire .
PHYSICA D-NONLINEAR PHENOMENA, 2020, 412
[10]  
Chandar S, 2019, AAAI CONF ARTIF INTE, P3280