Wide and deep neural networks achieve consistency for classification

被引：11

作者：

Radhakrishnan, Adityanarayanan ^{[1
,2
,3
]}

Belkin, Mikhail ^{[4
,5
]}

Uhler, Caroline ^{[1
,2
,3
]}

机构：

[1] MIT, Lab Informat & Decis Syst, Cambridge, MA 02142 USA

[2] MIT, Inst Data Syst & Soc, Cambridge, MA 02142 USA

[3] Broad Inst MIT, Cambridge, MA 02142 USA

[4] Univ Calif San Diego, Halicioglu Data Sci Inst, La Jolla, CA 92093 USA

[5] Univ Calif San Diego, Comp Sci & Engn, La Jolla, CA 92093 USA

来源：

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA | 2023年 / 120卷 / 14期

关键词：

neural networks; classification; consistency; neural tangent kernel;

D O I：

10.1073/pnas.2208779120

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

While neural networks are used for classification tasks across domains, a long-standing open problem in machine learning is determining whether neural networks trained using standard procedures are consistent for classification, i.e., whether such models minimize the probability of misclassification for arbitrary data distributions. In this work, we identify and construct an explicit set of neural network classifiers that are consistent. Since effective neural networks in practice are typically both wide and deep, we analyze infinitely wide networks that are also infinitely deep. In particular, using the recent connection between infinitely wide neural networks and neural tangent kernels, we provide explicit activation functions that can be used to construct networks that achieve consistency. Interestingly, these activation functions are simple and easy to implement, yet differ from commonly used activations such as ReLU or sigmoid. More generally, we create a taxonomy of infinitely wide and deep networks and show that these models implement one of three well-known classifiers depending on the activation function used: 1) 1-nearest neighbor (model predictions are given by the label of the nearest training example); 2) majority vote (model predictions are given by the label of the class with the greatest representation in the training set); or 3) singular kernel classifiers (a set of classifiers containing those that achieve consistency). Our results highlight the benefit of using deep networks for classification tasks, in contrast to regression tasks, where excessive depth is harmful.

引用

页数：8

共 46 条

[1]

[Anonymous], 2020, J AM STAT ASSOC

[2]

Arora S., 2020, International Conference on Learning Representations

[3]

Arora S, 2019, 33 C NEURAL INFORM P, V32

[4]

Beaglehole D, 2023, Arxiv, DOI arXiv:2205.13525

[5]

Belkin M., 2018, Advances in Neural Information Processing Systems, V31

[6]

Belkin M., 2018, ARXIV180609471

[7] Reconciling modern machine-learning practice and the classical bias-variance trade-off [J].

Belkin, Mikhail ;

Hsu, Daniel ;

Ma, Siyuan ;

Mandal, Soumik .

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2019, 116 (32) :15849-15854

[8]

Brown TB, 2020, ADV NEUR IN, V33

[9] NEAREST NEIGHBOR PATTERN CLASSIFICATION [J].

COVER, TM ;

HART, PE .

IEEE TRANSACTIONS ON INFORMATION THEORY, 1967, 13 (01) :21-+

[10]

Cybenko G., 1989, MATH CONTROL SIGNAL, V2, P304

← 1 2 3 4 5 →