Resolution and relevance trade-offs in deep learning

被引:20
作者
Song, Juyong [1 ,2 ,3 ]
Marsili, Matteo [3 ,4 ]
Jo, Junghyo [1 ,2 ,5 ,6 ]
机构
[1] Asia Pacific Ctr Theoret Phys, Pohang 37673, Gyeongbuk, South Korea
[2] Pohang Univ Sci & Technol, Dept Phys, Pohang 37673, Gyeongbuk, South Korea
[3] Abdus Salam Int Ctr Theoret Phys, Str Costiera 11, I-34014 Trieste, Italy
[4] Ist Nazl Fis Nucl, Sez Trieste, Trieste, Italy
[5] Korea Inst Adv Study, Sch Computat Sci, Seoul 02455, South Korea
[6] Keimyung Univ, Dept Stat, Daegu 42601, South Korea
基金
新加坡国家研究基金会;
关键词
deep learning; CRITICALITY; ALGORITHM;
D O I
10.1088/1742-5468/aaf10f
中图分类号
O3 [力学];
学科分类号
08 ; 0801 ;
摘要
Deep learning has been successfully applied to various tasks, but its underlying mechanism remains unclear. Neural networks associate similar inputs in the visible layer to the same state of hidden variables in deep layers. The fraction of inputs that are associated to the same state is a natural measure of similarity and is simply related to the cost in bits required to represent these inputs. The degeneracy of states with the same information cost provides instead a natural measure of noise and is simply related the entropy of the frequency of states, that we call relevance. Representations with minimal noise, at a given level of resolution, are those that maximise the relevance. A signature of such efficient representations is that frequency distributions follow power laws. We show, in extensive numerical experiments, that deep neural networks extract a hierarchy of efficient representations from data, because they (i) achieve low levels of noise (i.e. high relevance) and (ii) exhibit power law distributions. We also find that the layer that is most efficient to reliably generate patterns of training data is the one for which relevance and resolution are traded at the same price, which implies that frequency distribution follows Zipf's law.
引用
收藏
页数:14
相关论文
共 32 条
[1]  
ACKLEY DH, 1985, COGNITIVE SCI, V9, P147
[2]  
Aitchison L, 2016, PLOS COMPUT BIOL, V12
[3]  
[Anonymous], 1995, THESIS
[4]   Learning Deep Architectures for AI [J].
Bengio, Yoshua .
FOUNDATIONS AND TRENDS IN MACHINE LEARNING, 2009, 2 (01) :1-127
[5]  
Bishop C.M., 2006, PATTERN RECOGN, V4, P738, DOI DOI 10.1117/1.2819119
[6]   The physics of data [J].
Byers, Jeff .
NATURE PHYSICS, 2017, 13 (08) :718-719
[7]   Least effort and the origins of scaling in human language [J].
Cancho, RFI ;
Solé, RV .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2003, 100 (03) :788-791
[8]  
Carrasquilla J, 2017, NAT PHYS, V13, P431, DOI [10.1038/nphys4035, 10.1038/NPHYS4035]
[9]  
Chen G., 2015, ARXIV PREPRINT ARXIV
[10]  
Cubero R.J., 2018, ARXIV180800249