Block coordinate descent algorithm improves variable selection and estimation in error-in-variables regression

被引:3
作者
Escribe, Celia [1 ,2 ]
Lu, Tianyuan [1 ,3 ]
Keller-Baruch, Julyan [1 ,4 ]
Forgetta, Vincenzo [1 ]
Xiao, Bowei [1 ,3 ]
Richards, J. Brent [1 ,4 ,5 ,6 ]
Bhatnagar, Sahir [5 ,7 ]
Oualkacha, Karim [8 ]
Greenwood, Celia M. T. [1 ,4 ,5 ,9 ]
机构
[1] Jewish Gen Hosp, Lady Davis Inst Med Res, Montreal, PQ, Canada
[2] MIT, Operat Res Ctr, Cambridge, MA USA
[3] McGill Univ, Quantitat Life Sci Program, Montreal, PQ, Canada
[4] McGill Univ, Dept Human Genet, Montreal, PQ, Canada
[5] McGill Univ, Dept Epidemiol Biostatist & Occupat Hlth, Montreal, PQ, Canada
[6] Kings Coll London, Dept Twin Res & Genet Epidemiol, London, England
[7] McGill Univ, Dept Diagnost Radiol, Montreal, PQ, Canada
[8] Univ Quebec Montreal, Dept Mathemat, Montreal, PQ, Canada
[9] McGill Univ, Gerald Bronfman Dept Oncol, Montreal, PQ, Canada
基金
加拿大健康研究院; 加拿大自然科学与工程研究理事会;
关键词
estimation accuracy; high dimension; Lasso; measurement error; variable selection; NONCONCAVE PENALIZED LIKELIHOOD; MAXIMUM-LIKELIHOOD; FRACTURE; REGULARIZATION; INFERENCE; OBESITY; FUTURE; MODELS; RISK;
D O I
10.1002/gepi.22430
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Medical research increasingly includes high-dimensional regression modeling with a need for error-in-variables methods. The Convex Conditioned Lasso (CoCoLasso) utilizes a reformulated Lasso objective function and an error-corrected cross-validation to enable error-in-variables regression, but requires heavy computations. Here, we develop a Block coordinate Descent Convex Conditioned Lasso (BDCoCoLasso) algorithm for modeling high-dimensional data that are only partially corrupted by measurement error. This algorithm separately optimizes the estimation of the uncorrupted and corrupted features in an iterative manner to reduce computational cost, with a specially calibrated formulation of cross-validation error. Through simulations, we show that the BDCoCoLasso algorithm successfully copes with much larger feature sets than CoCoLasso, and as expected, outperforms the naive Lasso with enhanced estimation accuracy and consistency, as the intensity and complexity of measurement errors increase. Also, a new smoothly clipped absolute deviation penalization option is added that may be appropriate for some data sets. We apply the BDCoCoLasso algorithm to data selected from the UK Biobank. We develop and showcase the utility of covariate-adjusted genetic risk scores for body mass index, bone mineral density, and lifespan. We demonstrate that by leveraging more information than the naive Lasso in partially corrupted data, the BDCoCoLasso may achieve higher prediction accuracy. These innovations, together with an R package, BDCoCoLasso, make error-in-variables adjustments more accessible for high-dimensional data sets. We posit the BDCoCoLasso algorithm has the potential to be widely applied in various fields, including genomics-facilitated personalized medicine research.
引用
收藏
页码:874 / 890
页数:17
相关论文
共 49 条
  • [1] [Anonymous], 2006, J R STAT SOC B
  • [2] A Fast and Accurate Method for Genome-Wide Time-to-Event Data Analysis and Its Application to UK Biobank
    Bi, Wenjian
    Fritsche, Lars G.
    Mukherjee, Bhramar
    Kim, Sehee
    Lee, Seunggeun
    [J]. AMERICAN JOURNAL OF HUMAN GENETICS, 2020, 107 (02) : 222 - 233
  • [3] Distributed optimization and statistical learning via the alternating direction method of multipliers
    Boyd S.
    Parikh N.
    Chu E.
    Peleato B.
    Eckstein J.
    [J]. Foundations and Trends in Machine Learning, 2010, 3 (01): : 1 - 122
  • [4] MEBoost: Variable selection in the presence of measurement error
    Brown, Ben
    Weaver, Timothy
    Wolfson, Julian
    [J]. STATISTICS IN MEDICINE, 2019, 38 (15) : 2705 - 2718
  • [5] The UK Biobank resource with deep phenotyping and genomic data
    Bycroft, Clare
    Freeman, Colin
    Petkova, Desislava
    Band, Gavin
    Elliott, Lloyd T.
    Sharp, Kevin
    Motyer, Allan
    Vukcevic, Damjan
    Delaneau, Olivier
    O'Connell, Jared
    Cortes, Adrian
    Welsh, Samantha
    Young, Alan
    Effingham, Mark
    McVean, Gil
    Leslie, Stephen
    Allen, Naomi
    Donnelly, Peter
    Marchini, Jonathan
    [J]. NATURE, 2018, 562 (7726) : 203 - +
  • [6] CHESHER A, 1991, BIOMETRIKA, V78, P451
  • [7] COCOLASSO FOR HIGH-DIMENSIONAL ERROR-IN-VARIABLES REGRESSION
    Datta, Abhirup
    Zou, Hui
    [J]. ANNALS OF STATISTICS, 2017, 45 (06) : 2400 - 2426
  • [8] Deelen J, 2019, NAT COMMUN, V10, DOI [10.1038/s41467-019-11311-9, 10.1038/s41467-019-11558-2]
  • [9] MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM
    DEMPSTER, AP
    LAIRD, NM
    RUBIN, DB
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01): : 1 - 38
  • [10] The impact of nonnormality on full information maximum-likelihood estimation for structural equation models with missing data
    Enders, CK
    [J]. PSYCHOLOGICAL METHODS, 2001, 6 (04) : 352 - 370