Random forest analysis and lasso regression outperform traditional methods in identifying missing data auxiliary variables when the MAR mechanism is nonlinear (p.s. Stop using Little's MCAR test)

被引:3
作者
Hayes, Timothy [1 ]
Baraldi, Amanda N. [2 ]
Coxe, Stefany [3 ]
机构
[1] Florida Int Univ, Dept Psychol, SW 8 St, DM 381B, Miami, FL 11200 USA
[2] Oklahoma State Univ, Dept Psychol, Stillwater, OK USA
[3] Cedars Sinai Med Ctr, Dept Computat Biomed, Los Angeles, CA USA
关键词
Missing data; Auxiliary variables; Random forest; Missing at random; R PACKAGE; SELECTION; INFERENCE;
D O I
10.3758/s13428-024-02494-1
中图分类号
B841 [心理学研究方法];
学科分类号
040201 ;
摘要
The selection of auxiliary variables is an important first step in appropriately implementing missing data methods such as full information maximum likelihood (FIML) estimation or multiple imputation. However, practical guidelines and statistical tests for selecting useful auxiliary variables are somewhat lacking, leading to potentially biased estimates. We propose the use of random forest analysis and lasso regression as alternative methods to select auxiliary variables, particularly in situations in which the missing data pattern is nonlinear or otherwise complex (i.e., interactive relationships between variables and missingness). Monte Carlo simulations demonstrate the effectiveness of random forest analysis and lasso regression compared to traditional methods (t-tests, Little's MCAR test, logistic regressions), in terms of both selecting auxiliary variables and the performance of said auxiliary variables when incorporated in an analysis with missing data. Both techniques outperformed traditional methods, providing a promising direction for improvement of practical methods for handling missing data in statistical analyses.
引用
收藏
页码:8608 / 8639
页数:32
相关论文
共 53 条
[1]  
[Anonymous], 2002, Statistical Analysis with Missing Data, DOI DOI 10.1002/9781119013563
[2]  
[Anonymous], 2022, IBM SPSS Statistics for Macintosh
[3]  
Arbuckle J.L., 1996, Full Information Estimation in the Presence of Incomplete Data, DOI DOI 10.4324/9781315827414
[4]  
Berk R.A., 2009, Statistical learning from a regression perspective
[5]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[6]  
Cohen J., 2003, APPL MULTIPLE REGRES, V3rd, DOI 10.1007/978-1-59745-530-5_9
[7]   A comparison of inclusive and restrictive strategies in modern missing data procedures [J].
Collins, LM ;
Schafer, JL ;
Kam, CM .
PSYCHOLOGICAL METHODS, 2001, 6 (04) :330-351
[8]   Conditional permutation importance revisited [J].
Debeer, Dries ;
Strobl, Carolin .
BMC BIOINFORMATICS, 2020, 21 (01)
[9]  
Dixon WJ., 1988, BMDP STAT SOFTWARE
[10]  
Enders C.K., 2022, Applied Missing Data Analysis (Methodology in the Social Sciences Series