Validation and generalizability of machine learning prediction models on attrition in longitudinal studies

被引:10
作者
Jankowsky, Kristin [1 ]
Schroeders, Ulrich [1 ]
机构
[1] Univ Kassel, Kassel, Germany
关键词
Machine learning; attrition; longitudinal studies; predictive modeling; generalizability; MISSING DATA; NONRESPONSE; PSYCHOLOGY; FRAMEWORK;
D O I
10.1177/01650254221075034
中图分类号
B844 [发展心理学(人类心理学)];
学科分类号
040202 ;
摘要
Attrition in longitudinal studies is a major threat to the representativeness of the data and the generalizability of the findings. Typical approaches to address systematic nonresponse are either expensive and unsatisfactory (e.g., oversampling) or rely on the unrealistic assumption of data missing at random (e.g., multiple imputation). Thus, models that effectively predict who most likely drops out in subsequent occasions might offer the opportunity to take countermeasures (e.g., incentives). With the current study, we introduce a longitudinal model validation approach and examine whether attrition in two nationally representative longitudinal panel studies can be predicted accurately. We compare the performance of a basic logistic regression model with a more flexible, data-driven machine learning algorithm-gradient boosting machines. Our results show almost no difference in accuracies for both modeling approaches, which contradicts claims of similar studies on survey attrition. Prediction models could not be generalized across surveys and were less accurate when tested at a later survey wave. We discuss the implications of these findings for survey retention, the use of complex machine learning algorithms, and give some recommendations to deal with study attrition.
引用
收藏
页码:169 / 176
页数:8
相关论文
共 43 条
[11]   Greedy function approximation: A gradient boosting machine [J].
Friedman, JH .
ANNALS OF STATISTICS, 2001, 29 (05) :1189-1232
[12]   Struggles with survey weighting and regression modeling [J].
Gelman, Andrew .
STATISTICAL SCIENCE, 2007, 22 (02) :153-164
[13]  
Greenwell B., 2019, GBM GEN BOOSTED REGR
[14]   Auxiliary variables in multiple imputation in regression with missing X: a warning against including too many in small sample research [J].
Hardt, Jochen ;
Herke, Max ;
Leonhart, Rainer .
BMC MEDICAL RESEARCH METHODOLOGY, 2012, 12
[15]   Difficulty of Reaching Respondents and Nonresponse Bias: Evidence from Large Government Surveys [J].
Heffetz, Ori ;
Reeves, Daniel B. .
REVIEW OF ECONOMICS AND STATISTICS, 2019, 101 (01) :176-191
[16]  
Huinink J, 2011, Z FAMILIENCORSCH, V23, P77
[17]   Predictors of attrition in a longitudinal population-based study of aging [J].
Jacobsen, Erin ;
Ran, Xinhui ;
Liu, Anran ;
Chang, Chung-Chou H. ;
Ganguli, Mary .
INTERNATIONAL PSYCHOGERIATRICS, 2021, 33 (08) :767-778
[18]   Evidence of Inflated Prediction Performance: A Commentary on Machine Learning and Suicide Research [J].
Jacobucci, Ross ;
Littlefield, Andrew K. ;
Millner, Alexander J. ;
Kleiman, Evan M. ;
Steinley, Douglas .
CLINICAL PSYCHOLOGICAL SCIENCE, 2021, 9 (01) :129-134
[19]   Use of Missing Data Methods in Longitudinal Studies: The Persistence of Bad Practices in Developmental Psychology [J].
Jelicic, Helena ;
Phelps, Erin ;
Lerner, Richard A. .
DEVELOPMENTAL PSYCHOLOGY, 2009, 45 (04) :1195-1199
[20]  
Kern C., 2019, ARXIV190913361, DOI DOI 10.1080/0022250X.2013.877898