Replica analysis of overfitting in generalized linear regression models

被引:16
|
作者
Coolen, A. C. C. [1 ,2 ,3 ]
Sheikh, M. [4 ]
Mozeika, A. [3 ]
Aguirre-Lopez, F. [4 ]
Antenucci, F. [2 ]
机构
[1] Radboud Univ Nijmegen, Dept Biophys, NL-6525 AJ Nijmegen, Netherlands
[2] Saddle Point Sci Ltd, 35A South St, London W1K 2XF, England
[3] London Inst Math Sci, 35A South St, London W1K 2XF, England
[4] Kings Coll London, Dept Math, London WC2R 2LS, England
基金
英国生物技术与生命科学研究理事会; 英国医学研究理事会;
关键词
generalized linear models; overfitting; regression; replica method; STATISTICAL-MECHANICS; BAYESIAN-ANALYSIS; BIAS CORRECTION; MAP ESTIMATION; MOMENTS; MATRIX; CDMA; TAP;
D O I
10.1088/1751-8121/aba028
中图分类号
O4 [物理学];
学科分类号
0702 ;
摘要
Nearly all statistical inference methods were developed for the regime where the number N of data samples is much larger than the data dimension p. Inference protocols such as maximum likelihood (ML) or maximum a posteriori probability (MAP) are unreliable if p = O(N), due to overfitting. This limitation has for many disciplines with increasingly high-dimensional data become a serious bottleneck. We recently showed that in Cox regression for time-to-event data the overfitting errors are not just noise but take mostly the form of a bias, and how with the replica method from statistical physics one can model and predict this bias and the noise statistics. Here we extend our approach to arbitrary generalized linear regression models (GLM), with possibly correlated covariates. We analyse overfitting in ML/MAP inference without having to specify data types or regression models, relying only on the GLM form, and derive generic order parameter equations for the case of L2 priors. Second, we derive the probabilistic relationship between true and inferred regression coefficients in GLMs, and show that, for the relevant hyperparameter scaling and correlated covariates, the L2 regularization causes a predictable direction change of the coefficient vector. Our results, illustrated by application to linear, logistic, and Cox regression, enable one to correct ML and MAP inferences in GLMs systematically for overfitting bias, and thus extend their applicability into the hitherto forbidden regime p=O(N).
引用
收藏
页数:48
相关论文
共 50 条
  • [1] Replica analysis of overfitting in regression models for time-to-event data
    Coolen, A. C. C.
    Barrett, J. E.
    Paga, P.
    Perez-Vicente, C. J.
    JOURNAL OF PHYSICS A-MATHEMATICAL AND THEORETICAL, 2017, 50 (37)
  • [2] Replica analysis of overfitting in regression models for time to event data: the impact of censoring
    Massa, E.
    Mozeika, A.
    Coolen, A. C. C.
    JOURNAL OF PHYSICS A-MATHEMATICAL AND THEORETICAL, 2024, 57 (12)
  • [3] Benign overfitting in linear regression
    Bartlett, Peter L.
    Long, Philip M.
    Lugosi, Gabor
    Tsigler, Alexander
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2020, 117 (48) : 30063 - 30070
  • [4] Sibling Regression for Generalized Linear Models
    Shankar, Shiv
    Sheldon, Daniel
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2021: RESEARCH TRACK, PT II, 2021, 12976 : 781 - 795
  • [5] Correction of overfitting bias in regression models
    Massa, Emanuele
    Jonker, Marianne A.
    Roes, Kit
    Coolen, Anthony C.C.
    arXiv, 2022,
  • [6] Analysis of dental caries using generalized linear and count regression models
    Javali, S. B.
    Pandit, Parameshwar V.
    ROMANIAN STATISTICAL REVIEW, 2013, (10) : 73 - 82
  • [7] Coefficient tree regression for generalized linear models
    Surer, Ozge
    Apley, Daniel W.
    Malthouse, Edward C.
    STATISTICAL ANALYSIS AND DATA MINING, 2021, 14 (05) : 407 - 429
  • [8] General bound of overfitting for MLP regression models
    Rynkiewicz, J.
    NEUROCOMPUTING, 2012, 90 : 106 - 110
  • [9] Local linear regression for generalized linear models with missing data
    Wang, CY
    Wang, SJ
    Gutierrez, RG
    Carroll, RJ
    ANNALS OF STATISTICS, 1998, 26 (03): : 1028 - 1050
  • [10] Generalized orthogonal components regression for high dimensional generalized linear models
    Lin, Yanzhu
    Zhang, Min
    Zhang, Dabao
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2015, 88 : 119 - 127