A simulation study on missing data imputation for dichotomous variables using statistical and machine learning methods

被引:10
作者
Ge, Yingfeng [1 ]
Li, Zhiwei [1 ]
Zhang, Jinxin [1 ]
机构
[1] Sun Yat Sen Univ, Sch Publ Hlth, Dept Med Stat, Guangzhou 510080, Peoples R China
关键词
MULTIPLE IMPUTATION;
D O I
10.1038/s41598-023-36509-2
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The problem of missing data, particularly for dichotomous variables, is a common issue in medical research. However, few studies have focused on the imputation methods of dichotomous data and their performance, as well as the applicability of these imputation methods and the factors that may affect their performance. In the arrangement of application scenarios, different missing mechanisms, sample sizes, missing rates, the correlation between variables, value distributions, and the number of missing variables were considered. We used data simulation techniques to establish a variety of different compound scenarios for missing dichotomous variables and conducted real-data validation on two real-world medical datasets. We comprehensively compared the performance of eight imputation methods (mode, logistic regression (LogReg), multiple imputation (MI), decision tree (DT), random forest (RF), k-nearest neighbor (KNN), support vector machine (SVM), and artificial neural network (ANN)) in each scenario. Accuracy and mean absolute error (MAE) were applied to evaluating their performance. The results showed that missing mechanisms, value distributions and the correlation between variables were the main factors affecting the performance of imputation methods. Machine learning-based methods, especially SVM, ANN, and DT, achieved relatively high accuracy with stable performance and were of potential applicability. Researchers should explore the correlation between variables and their distribution pattern in advance and prioritize machine learning-based methods for practical applications when encountering dichotomous missing data.
引用
收藏
页数:13
相关论文
共 29 条
[1]   Applications of multiple imputation in medical studies: from AIDS as NHANES [J].
Barnard, J ;
Meng, XL .
STATISTICAL METHODS IN MEDICAL RESEARCH, 1999, 8 (01) :17-36
[2]   Small-sample degrees of freedom with multiple imputation [J].
Barnard, J ;
Rubin, DB .
BIOMETRIKA, 1999, 86 (04) :948-955
[3]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[4]   Comparing Statistical and Machine Learning Imputation Techniques in Breast Cancer Classification [J].
Chlioui, Imane ;
Abnane, Ibtissam ;
Idri, Ali .
COMPUTATIONAL SCIENCE AND ITS APPLICATIONS, ICCSA 2020, PART IV, 2020, 12252 :61-76
[5]   SUPPORT-VECTOR NETWORKS [J].
CORTES, C ;
VAPNIK, V .
MACHINE LEARNING, 1995, 20 (03) :273-297
[6]   NEAREST NEIGHBOR PATTERN CLASSIFICATION [J].
COVER, TM ;
HART, PE .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1967, 13 (01) :21-+
[7]  
Desiani A., 2021, SCI TECHNOL INDONES, V6, P303, DOI [10.26554/sti.2021.6.4.303-312, DOI 10.26554/STI.2021.6.4.303-312]
[8]   Generative adversarial networks for imputing missing data for big data clinical research [J].
Dong, Weinan ;
Fong, Daniel Yee Tak ;
Yoon, Jin-sun ;
Wan, Eric Yuk Fai ;
Bedford, Laura Elizabeth ;
Tang, Eric Ho Man ;
Lam, Cindy Lo Kuen .
BMC MEDICAL RESEARCH METHODOLOGY, 2021, 21 (01)
[9]   The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model [J].
Guo, Chao-Yu ;
Yang, Ying-Chen ;
Chen, Yi-Hau .
FRONTIERS IN PUBLIC HEALTH, 2021, 9
[10]  
Ho TK, 1998, IEEE T PATTERN ANAL, V20, P832, DOI 10.1109/34.709601