Understanding Deep Learning (Still) Requires Rethinking Generalization

被引:1430
作者
Zhang, Chiyuan [1 ]
Bengio, Samy [1 ]
Hardt, Moritz [1 ,2 ]
Recht, Benjamin [1 ,2 ]
Vinyals, Oriol [3 ]
机构
[1] Google Brain, Mountain View, CA 94043 USA
[2] Univ Calif Berkeley, Berkeley, CA 94720 USA
[3] DeepMind, London N1C 4AG, England
关键词
Deep learning - Sampling - Classification (of information) - Convolutional neural networks - Image classification - Gradient methods;
D O I
10.1145/3446776
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small gap between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family or to the regularization techniques used during training. Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization and occurs even if we replace the true images by completely unstructured random noise. We corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points as it usually does in practice. We interpret our experimental findings by comparison with traditional models. We supplement this republication with a new section at the end summarizing recent progresses in the field since the original version of this paper.
引用
收藏
页码:107 / 115
页数:9
相关论文
共 40 条
[1]  
[Anonymous], 2019, ARXIV190605271
[2]  
Arora S., 2019, ADV NEURAL INFORM PR, V32, P7411
[3]  
Arora S, 2018, PR MACH LEARN RES, V80
[4]  
Bartlett Peter L., 2017, Advances in Neural Information Processing Systems, V30
[5]   The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network [J].
Bartlett, PL .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1998, 44 (02) :525-536
[6]  
Belkin M., 2018, ADV NEURAL INFORM PR, P2300
[7]  
Benin M., P NATL ACAD
[8]   Stability and generalization [J].
Bousquet, O ;
Elisseeff, A .
JOURNAL OF MACHINE LEARNING RESEARCH, 2002, 2 (03) :499-526
[9]  
Brian R., 2016, C LEARNING THEORY, P907
[10]  
Choromanska A., 2015, ARTIFICIAL INTELLIGE, P102