Missing data imputation on biomedical data using deeply learned clustering and L2 regularized regression based on symmetric uncertainty

被引:14
作者
Nagarajan, Gayathri [1 ]
Babu, L. D. Dhinesh [1 ]
机构
[1] VIT Univ, Sch Informat Technol & Engn, Chennai, Tamil Nadu, India
关键词
Deeply learned clustering; L2; regularization; Missing data imputation; Biomedical datasets; GENETIC ALGORITHM; PREDICTION; MODEL;
D O I
10.1016/j.artmed.2021.102214
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Big data era in healthcare led to the generation of high dimensional datasets like genomic datasets, electronic health records etc. One among the critical issues to be addressed in such datasets is handling incomplete data that may yield misleading results if not handled properly. Imputation is considered to be an effective way when the missing data rate is high. While imputation accuracy and classification accuracy are the two important metrics generally considered by most of the imputation techniques, high dimensional datasets such as genomic datasets motivated the need for imputation techniques that are also computationally efficient and preserves the structure of the dataset. This paper proposes a novel approach to missing data imputation in biomedical datasets using an ensemble of deeply learned clustering and L2 regularized regression based on symmetric uncertainty. The experiments are conducted with different proportion of missing data on both genomic and non-genomic biomedical datasets for different types of missingness pattern. Our proposed approach is compared with seven proven baseline imputation methods and two recently proposed imputation approaches. The results show that the proposed approach outperforms the other approaches considered in our experimentation in terms of imputation accuracy and computational efficiency despite preserving the structure of the dataset. Thus, the overall classification accuracy of the biomedical classification tasks is also improved when our proposed missing data imputation technique is used.
引用
收藏
页数:16
相关论文
共 47 条
  • [1] Aiguo Wang N. A. J. Y. L. L., IEEE ACM T COMPUTATI
  • [2] A comparison of imputation techniques for handling missing predictor values in a risk model with a binary outcome
    Ambler, Gareth
    Omar, Rumana Z.
    Royston, Patrick
    [J]. STATISTICAL METHODS IN MEDICAL RESEARCH, 2007, 16 (03) : 277 - 298
  • [3] Missing data imputation using fuzzy-rough methods
    Amiri, Mehran
    Jensen, Richard
    [J]. NEUROCOMPUTING, 2016, 205 : 152 - 164
  • [4] Matrix and Tensor Based Methods for Missing Data Estimation in Large Traffic Networks
    Asif, Muhammad Tayyab
    Mitrovic, Nikola
    Dauwels, Justin
    Jaillet, Patrick
    [J]. IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2016, 17 (07) : 1816 - 1825
  • [5] Bertsimas D, 2018, J MACH LEARN RES, V18
  • [6] Kernel Sparse Representation with Hybrid Regularization for On-Road Traffic Sensor Data Imputation
    Chen, Xiaobo
    Chen, Cheng
    Cai, Yingfeng
    Wang, Hai
    Ye, Qiaolin
    [J]. SENSORS, 2018, 18 (09)
  • [7] A global learning with local preservation method for microarray data imputation
    Chen, Ye
    Wang, Aiguo
    Ding, Huitong
    Que, Xia
    Li, Yabo
    An, Ning
    Jiang, Lili
    [J]. COMPUTERS IN BIOLOGY AND MEDICINE, 2016, 77 : 76 - 89
  • [8] Impact of missing data imputation methods on gene expression clustering and classification
    de Souto, Marcilio C. P.
    Jaskowiak, Pablo A.
    Costa, Ivan G.
    [J]. BMC BIOINFORMATICS, 2015, 16
  • [9] Deng M. S. I. Yi, SCI REP-UK
  • [10] Diederik J.B., 2015, INT C LEARN REPR