Instability of Variable-selection Algorithms Used to Identify True Predictors of an Outcome in Intermediate-dimension Epidemiologic Studies

被引:6
作者
Cadiou, Solene [1 ]
Slama, Remy [1 ]
机构
[1] Univ Grenoble Alpes, Inst Adv Biosci, CHU Grenoble Alpes, CNRS,Team Environm Epidemiol,IAB,Inserm, Grenoble, France
基金
欧盟地平线“2020”;
关键词
False discovery rate; Feature selection; Least Absolute Shrinkage and Selection Operator; Machine learning; Model stability; Reproducibility; PRENATAL EXPOSURE; WIDE ASSOCIATION; EXPOSOME; REGRESSION; STABILITY; LASSO; RISK; REGULARIZATION; IDENTIFICATION; PERFORMANCE;
D O I
10.1097/EDE.0000000000001340
中图分类号
R1 [预防医学、卫生学];
学科分类号
1004 ; 120402 ;
摘要
Background: Machine-learning algorithms are increasingly used in epidemiology to identify true predictors of a health outcome when many potential predictors are measured. However, these algorithms can provide different outputs when repeatedly applied to the same dataset, which can compromise research reproducibility. We aimed to illustrate that commonly used algorithms are unstable and, using the example of Least Absolute Shrinkage and Selection Operator (LASSO), that stabilization method choice is crucial. Methods: In a simulation study, we tested the stability and performance of widely used machine-learning algorithms (LASSO, Elastic-Net, and Deletion-Substitution-Addition [DSA]). We then assessed the effectiveness of six methods to stabilize LASSO and their impact on performance. We assumed that a linear combination of factors drawn from a simulated set of 173 quantitative variables assessed in 1,301 subjects influenced to varying extents a continuous health outcome. We assessed model stability, sensitivity, and false discovery proportion. Results: All tested algorithms were unstable. For LASSO, stabilization methods improved stability without ensuring perfect stability, a finding confirmed by application to an exposome study. Stabilization methods also affected performance. Specifically, stabilization based on hyperparameter optimization, frequently implemented in epidemiology, increased the false discovery proportion dramatically when predictors explained a low share of outcome variability. In contrast, stabilization based on stability selection procedure often decreased the false discovery proportion, while sometimes simultaneously lowering sensitivity. Conclusions: Machine-learning methods instability should concern epidemiologists relying on them for variable selection, as stabilizing a model can impact its performance. For LASSO, stabilization methods based on stability selection procedure (rather than addressing prediction stability) should be preferred to identify true predictors.
引用
收藏
页码:402 / 411
页数:10
相关论文
共 52 条
[1]   Relying on repeated biospecimens to reduce the effects of classical-type exposure measurement error in studies linking the exposome to health [J].
Agier, Lydiane ;
Slama, Remy ;
Basagana, Xavier .
ENVIRONMENTAL RESEARCH, 2020, 186
[2]   Early-life exposome and lung function in children in Europe: an analysis of data from the longitudinal, population-based HELIX cohort [J].
Agier, Lydiane ;
Basagana, Xavier ;
Maitre, Lea ;
Granum, Berit ;
Bird, Philippa K. ;
Casas, Maribel ;
Oftedal, Bente ;
Wright, John ;
Andrusaityte, Sandra ;
de Castro, Montserrat ;
Cequier, Enrique ;
Chatzi, Leda ;
Donaire-Gonzalez, David ;
Grazuleviciene, Regina ;
Haug, Line S. ;
Sakhi, Amrit K. ;
Leventakou, Vasiliki ;
McEachan, Rosemary ;
Nieuwenhuijsen, Mark ;
Petraviciene, Inga ;
Robinson, Oliver ;
Roumeliotaki, Theano ;
Sunyer, Jordi ;
Tamayo-Uria, Ibon ;
Thomsen, Cathrine ;
Urquiza, Jose ;
Valentin, Antonia ;
Slama, Remy ;
Vrijheid, Martine ;
Siroux, Valerie .
LANCET PLANETARY HEALTH, 2019, 3 (02) :E81-E92
[3]   A Systematic Comparison of Linear Regression-Based Statistical Methods to Assess Exposome-Health Associations [J].
Agier, Lydiane ;
Portengen, Lutzen ;
Chadeau-Hyam, Marc ;
Basagana, Xavier ;
Giorgis-Allemand, Lise ;
Siroux, Valerie ;
Robinson, Oliver ;
Vlaanderen, Jelle ;
Gonzalez, Juan R. ;
Nieuwenhuijsen, Mark J. ;
Vineis, Paolo ;
Vrijheid, Martine ;
Slama, Remy ;
Vermeulen, Roel .
ENVIRONMENTAL HEALTH PERSPECTIVES, 2016, 124 (12) :1848-1856
[4]  
Bach FR, 2008, P 25 INT C MACHINE L, P33
[5]  
Belloni A, 2013, ECON SOC MONOGR, P245
[6]   Methylome-wide association study of whole blood DNA in the Norfolk Island isolate identifies robust loci associated with age [J].
Benton, Miles C. ;
Sutherland, Heidi G. ;
Macartney-Coxson, Donia ;
Haupt, Larisa M. ;
Lea, Rodney A. ;
Griffiths, Lyn R. .
AGING-US, 2017, 9 (03) :753-768
[7]   Stability and aggregation of ranked gene lists [J].
Boulesteix, Anne-Laure ;
Slawski, Martin .
BRIEFINGS IN BIOINFORMATICS, 2009, 10 (05) :556-568
[8]   Stability and generalization [J].
Bousquet, O ;
Elisseeff, A .
JOURNAL OF MACHINE LEARNING RESEARCH, 2002, 2 (03) :499-526
[9]   Copper toxicology, oxidative stress and inflammation using zebrafish as experimental model [J].
Brandao Pereira, Talita Carneiro ;
Campos, Maria Martha ;
Bogo, Mauricio Reis .
JOURNAL OF APPLIED TOXICOLOGY, 2016, 36 (07) :876-885
[10]   Copper toxicity in the general population [J].
Brewer, George J. .
CLINICAL NEUROPHYSIOLOGY, 2010, 121 (04) :459-460