A novel MissForest-based missing values imputation approach with recursive feature elimination in medical applications

被引:1
作者
Hu, Ya-Han [1 ]
Wu, Ruei-Yan [1 ]
Lin, Yen-Cheng [1 ]
Lin, Ting-Yin [2 ]
机构
[1] Natl Cent Univ, Dept Informat Management, Taoyuan City, Taiwan
[2] Chia Yi Christian Hosp, Ditmanson Med Fdn, Dept Lab Med, Chiayi City, Taiwan
关键词
Missing value imputation; MissForest; Recursive feature elimination; Feature selection; Medical datasets; FEATURE-SELECTION; INCOMPLETE DATA; MULTIPLE IMPUTATION; CANCER CLASSIFICATION; CHAINED EQUATIONS; GENE SELECTION; RFE;
D O I
10.1186/s12874-024-02392-2
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
BackgroundMissing values in datasets present significant challenges for data analysis, particularly in the medical field where data accuracy is crucial for patient diagnosis and treatment. Although MissForest (MF) has demonstrated efficacy in imputation research and recursive feature elimination (RFE) has proven effective in feature selection, the potential for enhancing MF through RFE integration remains unexplored.MethodsThis study introduces a novel imputation method, "recursive feature elimination-MissForest" (RFE-MF), designed to enhance imputation quality by reducing the impact of irrelevant features. A comparative analysis is conducted between RFE-MF and four classical imputation methods: mean/mode, k-nearest neighbors (kNN), multiple imputation by chained equations (MICE), and MF. The comparison is carried out across ten medical datasets containing both numerical and mixed data types. Different missing data rates, ranging from 10 to 50%, are evaluated under the missing completely at random (MCAR) mechanism. The performance of each method is assessed using two evaluation metrics: normalized root mean squared error (NRMSE) and predictive fidelity criterion (PFC). Additionally, paired samples t-tests are employed to analyze the statistical significance of differences among the outcomes.ResultsThe findings indicate that RFE-MF demonstrates superior performance across the majority of datasets when compared to four classical imputation methods (mean/mode, kNN, MICE, and MF). Notably, RFE-MF consistently outperforms the original MF, irrespective of variable type (numerical or categorical). Mean/mode imputation exhibits consistent performance across various scenarios. Conversely, the efficacy of kNN imputation fluctuates in relation to varying missing data rates.ConclusionThis study demonstrates that RFE-MF holds promise as an effective imputation method for medical datasets, providing a novel approach to addressing missing data challenges in medical applications.
引用
收藏
页数:12
相关论文
共 86 条
  • [1] Identify the most appropriate imputation method for handling missing values in clinical structured datasets: a systematic review
    Afkanpour, Marziyeh
    Hosseinzadeh, Elham
    Tabesh, Hamed
    [J]. BMC MEDICAL RESEARCH METHODOLOGY, 2024, 24 (01)
  • [2] Improving Penalized Logistic Regression Model with Missing Values in High-Dimensional Data
    Alharthi, Aiedh Mrisi
    Lee, Muhammad Hisyam
    Algamal, Zakariya Yahya
    [J]. INTERNATIONAL JOURNAL OF ONLINE AND BIOMEDICAL ENGINEERING, 2022, 18 (02) : 40 - 54
  • [3] [Anonymous], 2010, The Prevention and Treatment of Missing Data in Clinical Trials
  • [4] Imputation of missing clinical, cognitive and neuroimaging data of Dementia using missForest, a Random Forest based algorithm
    Aracri, Federica
    Bianco, Maria Giovanna
    Quattrone, Andrea
    Sarica, Alessia
    [J]. 2023 IEEE 36TH INTERNATIONAL SYMPOSIUM ON COMPUTER-BASED MEDICAL SYSTEMS, CBMS, 2023, : 684 - 688
  • [5] Automatic gap-filling of daily streamflow time series in data-scarce regions using a machine learning algorithm
    Arriagada, Pedro
    Karelovic, Bruno
    Link, Oscar
    [J]. JOURNAL OF HYDROLOGY, 2021, 598
  • [6] Missing Data in Clinical Research: A Tutorial on Multiple Imputation
    Austin, Peter C.
    White, Ian R.
    Lee, Douglas S.
    van Buuren, Stef
    [J]. CANADIAN JOURNAL OF CARDIOLOGY, 2021, 37 (09) : 1322 - 1331
  • [7] EvoImputer: An evolutionary approach for Missing Data Imputation and feature selection in the context of supervised learning
    Awawdeh, Shatha
    Faris, Hossam
    Hiary, Hazem
    [J]. KNOWLEDGE-BASED SYSTEMS, 2022, 236
  • [8] Multiple imputation by chained equations: what is it and how does it work?
    Azur, Melissa J.
    Stuart, Elizabeth A.
    Frangakis, Constantine
    Leaf, Philip J.
    [J]. INTERNATIONAL JOURNAL OF METHODS IN PSYCHIATRIC RESEARCH, 2011, 20 (01) : 40 - 49
  • [9] Batista GEAPA, 2003, APPL ARTIF INTELL, V17, P519, DOI 10.1080/08839510390219309
  • [10] Characterizing and Managing Missing Structured Data in Electronic Health Records: Data Analysis
    Beaulieu-Jones, Brett K.
    Lavage, Daniel R.
    Snyder, John W.
    Moore, Jason H.
    Pendergrass, Sarah A.
    Bauer, Christopher R.
    [J]. JMIR MEDICAL INFORMATICS, 2018, 6 (01)