Neural networks trained with SGD learn distributions of increasing complexity*

被引:0
作者
Refinetti, Maria [1 ,2 ]
Ingrosso, Alessandro [3 ]
Goldt, Sebastian [4 ]
机构
[1] Univ Paris Diderot, Sorbonne Univ, Univ PSL, Sorbonne Paris Cite,CNRS,Lab Phys Ecole Normale Su, Paris, France
[2] Ecole Fed Polytech Lausanne EPFL, IdePHICS Lab, Lausanne, Switzerland
[3] Abdus Salam Int Ctr Theoret Phys ICTP, Trieste, Italy
[4] Int Sch Adv Studies SISSA, Trieste, Italy
来源
JOURNAL OF STATISTICAL MECHANICS-THEORY AND EXPERIMENT | 2025年 / 2025卷 / 02期
关键词
machine learning; ICML;
D O I
10.1088/1742-5468/ad8bb8
中图分类号
O3 [力学];
学科分类号
08 ; 0801 ;
摘要
The uncanny ability of over-parameterised neural networks to generalise well has been explained using various 'simplicity biases'. These theories postulate that neural networks avoid overfitting by first fitting simple, linear classifiers before learning more complex, non-linear functions. Meanwhile, data structure is also recognised as a key ingredient for good generalisation, yet its role in simplicity bias is not yet understood. Here, we show that neural networks trained using stochastic gradient descent initially classify their inputs using lower-order input statistics, such as mean and covariance, and exploit higher-order statistics only later during training. We first demonstrate this distributional simplicity bias (DSB) in a solvable model of a single neuron trained on synthetic data. We then demonstrate DSB empirically in a range of deep convolutional networks and visual transformers trained on CIFAR10, and show that it even holds in networks pre-trained on ImageNet. We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of Gaussian universality in learning.
引用
收藏
页数:28
相关论文
empty
未找到相关数据