Block coordinate descent algorithm improves variable selection and estimation in error-in-variables regression

被引：3

作者：

Escribe, Celia ^{[1
,2
]}

Lu, Tianyuan ^{[1
,3
]}

Keller-Baruch, Julyan ^{[1
,4
]}

Forgetta, Vincenzo ^{[1
]}

Xiao, Bowei ^{[1
,3
]}

Richards, J. Brent ^{[1
,4
,5
,6
]}

Bhatnagar, Sahir ^{[5
,7
]}

Oualkacha, Karim ^{[8
]}

Greenwood, Celia M. T. ^{[1
,4
,5
,9
]}

机构：

[1] Jewish Gen Hosp, Lady Davis Inst Med Res, Montreal, PQ, Canada

[2] MIT, Operat Res Ctr, Cambridge, MA USA

[3] McGill Univ, Quantitat Life Sci Program, Montreal, PQ, Canada

[4] McGill Univ, Dept Human Genet, Montreal, PQ, Canada

[5] McGill Univ, Dept Epidemiol Biostatist & Occupat Hlth, Montreal, PQ, Canada

[6] Kings Coll London, Dept Twin Res & Genet Epidemiol, London, England

[7] McGill Univ, Dept Diagnost Radiol, Montreal, PQ, Canada

[8] Univ Quebec Montreal, Dept Mathemat, Montreal, PQ, Canada

[9] McGill Univ, Gerald Bronfman Dept Oncol, Montreal, PQ, Canada

来源：

GENETIC EPIDEMIOLOGY | 2021年 / 45卷 / 08期

基金：

加拿大健康研究院; 加拿大自然科学与工程研究理事会;

关键词：

estimation accuracy; high dimension; Lasso; measurement error; variable selection; NONCONCAVE PENALIZED LIKELIHOOD; MAXIMUM-LIKELIHOOD; FRACTURE; REGULARIZATION; INFERENCE; OBESITY; FUTURE; MODELS; RISK;

D O I：

10.1002/gepi.22430

中图分类号：

Q3 [遗传学];

学科分类号：

071007 ; 090102 ;

摘要：

Medical research increasingly includes high-dimensional regression modeling with a need for error-in-variables methods. The Convex Conditioned Lasso (CoCoLasso) utilizes a reformulated Lasso objective function and an error-corrected cross-validation to enable error-in-variables regression, but requires heavy computations. Here, we develop a Block coordinate Descent Convex Conditioned Lasso (BDCoCoLasso) algorithm for modeling high-dimensional data that are only partially corrupted by measurement error. This algorithm separately optimizes the estimation of the uncorrupted and corrupted features in an iterative manner to reduce computational cost, with a specially calibrated formulation of cross-validation error. Through simulations, we show that the BDCoCoLasso algorithm successfully copes with much larger feature sets than CoCoLasso, and as expected, outperforms the naive Lasso with enhanced estimation accuracy and consistency, as the intensity and complexity of measurement errors increase. Also, a new smoothly clipped absolute deviation penalization option is added that may be appropriate for some data sets. We apply the BDCoCoLasso algorithm to data selected from the UK Biobank. We develop and showcase the utility of covariate-adjusted genetic risk scores for body mass index, bone mineral density, and lifespan. We demonstrate that by leveraging more information than the naive Lasso in partially corrupted data, the BDCoCoLasso may achieve higher prediction accuracy. These innovations, together with an R package, BDCoCoLasso, make error-in-variables adjustments more accessible for high-dimensional data sets. We posit the BDCoCoLasso algorithm has the potential to be widely applied in various fields, including genomics-facilitated personalized medicine research.

引用

页码：874 / 890

页数：17

共 49 条

[1] [Anonymous], 2006, J R STAT SOC B
[2] A Fast and Accurate Method for Genome-Wide Time-to-Event Data Analysis and Its Application to UK Biobank
Bi, Wenjian
Fritsche, Lars G.
Mukherjee, Bhramar
Kim, Sehee
Lee, Seunggeun
[J]. AMERICAN JOURNAL OF HUMAN GENETICS, 2020, 107 (02) : 222 - 233
[3] Distributed optimization and statistical learning via the alternating direction method of multipliers
Boyd S.
Parikh N.
Chu E.
Peleato B.
Eckstein J.
[J]. Foundations and Trends in Machine Learning, 2010, 3 (01): : 1 - 122
[4] MEBoost: Variable selection in the presence of measurement error
Brown, Ben
Weaver, Timothy
Wolfson, Julian
[J]. STATISTICS IN MEDICINE, 2019, 38 (15) : 2705 - 2718
[5] The UK Biobank resource with deep phenotyping and genomic data
Bycroft, Clare
Freeman, Colin
Petkova, Desislava
Band, Gavin
Elliott, Lloyd T.
Sharp, Kevin
Motyer, Allan
Vukcevic, Damjan
Delaneau, Olivier
O'Connell, Jared
Cortes, Adrian
Welsh, Samantha
Young, Alan
Effingham, Mark
McVean, Gil
Leslie, Stephen
Allen, Naomi
Donnelly, Peter
Marchini, Jonathan
[J]. NATURE, 2018, 562 (7726) : 203 - +
[6] CHESHER A, 1991, BIOMETRIKA, V78, P451
[7] COCOLASSO FOR HIGH-DIMENSIONAL ERROR-IN-VARIABLES REGRESSION
Datta, Abhirup
Zou, Hui
[J]. ANNALS OF STATISTICS, 2017, 45 (06) : 2400 - 2426
[8] Deelen J, 2019, NAT COMMUN, V10, DOI [10.1038/s41467-019-11311-9, 10.1038/s41467-019-11558-2]
[9] MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM
DEMPSTER, AP
LAIRD, NM
RUBIN, DB
[J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01): : 1 - 38
[10] The impact of nonnormality on full information maximum-likelihood estimation for structural equation models with missing data
Enders, CK
[J]. PSYCHOLOGICAL METHODS, 2001, 6 (04) : 352 - 370

← 1 2 3 4 5 →