Understanding Deep Learning (Still) Requires Rethinking Generalization

被引：1430

作者：

Zhang, Chiyuan ^{[1
]}

Bengio, Samy ^{[1
]}

Hardt, Moritz ^{[1
,2
]}

Recht, Benjamin ^{[1
,2
]}

Vinyals, Oriol ^{[3
]}

机构：

[1] Google Brain, Mountain View, CA 94043 USA

[2] Univ Calif Berkeley, Berkeley, CA 94720 USA

[3] DeepMind, London N1C 4AG, England

来源：

COMMUNICATIONS OF THE ACM | 2021年 / 64卷 / 03期

关键词：

Deep learning - Sampling - Classification (of information) - Convolutional neural networks - Image classification - Gradient methods;

D O I：

10.1145/3446776

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small gap between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family or to the regularization techniques used during training. Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization and occurs even if we replace the true images by completely unstructured random noise. We corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points as it usually does in practice. We interpret our experimental findings by comparison with traditional models. We supplement this republication with a new section at the end summarizing recent progresses in the field since the original version of this paper.

引用

页码：107 / 115

页数：9

共 40 条

[1]

[Anonymous], 2019, ARXIV190605271

[2]

Arora S., 2019, ADV NEURAL INFORM PR, V32, P7411

[3]

Arora S, 2018, PR MACH LEARN RES, V80

[4]

Bartlett Peter L., 2017, Advances in Neural Information Processing Systems, V30

[5] The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network [J].

Bartlett, PL .

IEEE TRANSACTIONS ON INFORMATION THEORY, 1998, 44 (02) :525-536

[6]

Belkin M., 2018, ADV NEURAL INFORM PR, P2300

[7]

Benin M., P NATL ACAD

[8] Stability and generalization [J].

Bousquet, O ;

Elisseeff, A .

JOURNAL OF MACHINE LEARNING RESEARCH, 2002, 2 (03) :499-526

[9]

Brian R., 2016, C LEARNING THEORY, P907

[10]

Choromanska A., 2015, ARTIFICIAL INTELLIGE, P102

← 1 2 3 4 →