Missing data analyses: a hybrid multiple imputation algorithm using Gray System Theory and entropy based on clustering

被引:50
|
作者
Tian, Jing [1 ]
Yu, Bing [1 ]
Yu, Dan [1 ]
Ma, Shilong [1 ,2 ]
机构
[1] Beihang Univ, State Key Lab Software Dev Environm, Beijing, Peoples R China
[2] Beihang Univ, Sch Comp Sci & Engn, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Missing data; Multiple imputation; Gray System Theory; Entropy; Clustering; ESTIMATING NULL VALUES; FUZZY C-MEANS; INFORMATION;
D O I
10.1007/s10489-013-0469-x
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Researchers and practitioners who use databases usually feel that it is cumbersome in knowledge discovery or application development due to the issue of missing data. Though some approaches can work with a certain rate of incomplete data, a large portion of them demands high data quality with completeness. Therefore, a great number of strategies have been designed to process missingness particularly in the way of imputation. Single imputation methods initially succeeded in predicting the missing values for specific types of distributions. Yet, the multiple imputation algorithms have maintained prevalent because of the further promotion of validity by minimizing the bias iteratively and less requirement on prior knowledge to the distributions. This article carefully reviews the state of the art and proposes a hybrid missing data completion method named Multiple Imputation using Gray-system-theory and Entropy based on Clustering (MIGEC). Firstly, the non-missing data instances are separated into several clusters. Then, the imputed value is obtained after multiple calculations by utilizing the information entropy of the proximal category for each incomplete instance in terms of the similarity metric based on Gray System Theory (GST). Experimental results on University of California Irvine (UCI) datasets illustrate the superiority of MIGEC to other current achievements on accuracy for either numeric or categorical attributes under different missing mechanisms. Further discussion on real aerospace datasets states MIGEC is also applicable for the specific area with both more precise inference and faster convergence than other multiple imputation methods in general.
引用
收藏
页码:376 / 388
页数:13
相关论文
共 50 条
  • [31] Handling Missing Data in Cross-Classified Multilevel Analyses: An Evaluation of Different Multiple Imputation Approaches
    Grund, Simon
    Luedtke, Oliver
    Robitzsch, Alexander
    JOURNAL OF EDUCATIONAL AND BEHAVIORAL STATISTICS, 2023, 48 (04) : 454 - 489
  • [32] A novel clustering-based purity and distance imputation for handling medical data with missing values
    Cheng, Ching-Hsue
    Huang, Shu-Fen
    SOFT COMPUTING, 2021, 25 (17) : 11781 - 11801
  • [33] Imputation Method Based on Collaborative Filtering and Clustering for the Missing Data of the Squeeze Casting Process Parameters
    Deng, Jianxin
    Ye, Zhixing
    Shan, Lubao
    You, Dongdong
    Liu, Guangming
    INTEGRATING MATERIALS AND MANUFACTURING INNOVATION, 2022, 11 (01) : 95 - 108
  • [34] Wind power prediction with missing data using Gaussian process regression and multiple imputation
    Liu, Tianhong
    Wei, Haikun
    Zhang, Kanjian
    APPLIED SOFT COMPUTING, 2018, 71 : 905 - 916
  • [35] Using Multiple Imputation to Account for the Uncertainty Due to Missing Data in the Context of Factor Retention
    Xia, Yan
    Havan, Selim
    EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 2024, 84 (03) : 577 - 593
  • [36] An entropy-based subspace clustering algorithm for categorical data
    Carbonera, Joel Luis
    Abel, Mara
    2014 IEEE 26TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI), 2014, : 272 - 277
  • [37] Missing Data Imputation using Machine Learning Algorithm for Supervised Learning
    Cenitta, D.
    Arjunan, R. Vijaya
    Prema, K., V
    2021 INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATION AND INFORMATICS (ICCCI), 2021,
  • [38] Missing Values Imputation Using Genetic Algorithm for the Analysis of Traffic Data
    Midde, Ranjit Reddy
    Srinivasa, K. G.
    Reddy, Eswara B.
    ARTIFICIAL INTELLIGENCE AND EVOLUTIONARY COMPUTATIONS IN ENGINEERING SYSTEMS, ICAIECES 2017, 2018, 668 : 251 - 261
  • [39] Using multiple imputation to estimate cumulative distribution functions in longitudinal data analysis with data missing at random
    Dinh, Phillip
    PHARMACEUTICAL STATISTICS, 2013, 12 (05) : 260 - 267
  • [40] Sensitivity Analysis of Missing Data: Case Studies Using Model-Based Multiple Imputation
    Zhang, Jie
    DRUG INFORMATION JOURNAL, 2009, 43 (04): : 475 - 484