Missing data analyses: a hybrid multiple imputation algorithm using Gray System Theory and entropy based on clustering

被引:50
|
作者
Tian, Jing [1 ]
Yu, Bing [1 ]
Yu, Dan [1 ]
Ma, Shilong [1 ,2 ]
机构
[1] Beihang Univ, State Key Lab Software Dev Environm, Beijing, Peoples R China
[2] Beihang Univ, Sch Comp Sci & Engn, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Missing data; Multiple imputation; Gray System Theory; Entropy; Clustering; ESTIMATING NULL VALUES; FUZZY C-MEANS; INFORMATION;
D O I
10.1007/s10489-013-0469-x
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Researchers and practitioners who use databases usually feel that it is cumbersome in knowledge discovery or application development due to the issue of missing data. Though some approaches can work with a certain rate of incomplete data, a large portion of them demands high data quality with completeness. Therefore, a great number of strategies have been designed to process missingness particularly in the way of imputation. Single imputation methods initially succeeded in predicting the missing values for specific types of distributions. Yet, the multiple imputation algorithms have maintained prevalent because of the further promotion of validity by minimizing the bias iteratively and less requirement on prior knowledge to the distributions. This article carefully reviews the state of the art and proposes a hybrid missing data completion method named Multiple Imputation using Gray-system-theory and Entropy based on Clustering (MIGEC). Firstly, the non-missing data instances are separated into several clusters. Then, the imputed value is obtained after multiple calculations by utilizing the information entropy of the proximal category for each incomplete instance in terms of the similarity metric based on Gray System Theory (GST). Experimental results on University of California Irvine (UCI) datasets illustrate the superiority of MIGEC to other current achievements on accuracy for either numeric or categorical attributes under different missing mechanisms. Further discussion on real aerospace datasets states MIGEC is also applicable for the specific area with both more precise inference and faster convergence than other multiple imputation methods in general.
引用
收藏
页码:376 / 388
页数:13
相关论文
共 50 条
  • [41] Multiple imputation versus data enhancement for dealing with missing data in observational health care outcome analyses
    Faris, PD
    Ghali, WA
    Brant, R
    Norris, CM
    Galbraith, PD
    Knudtson, ML
    JOURNAL OF CLINICAL EPIDEMIOLOGY, 2002, 55 (02) : 184 - 191
  • [42] Sensitivity Analysis of Missing Data: Case Studies Using Model-Based Multiple Imputation
    Jie Zhang
    Drug information journal : DIJ / Drug Information Association, 2009, 43 (4): : 475 - 484
  • [43] A Hybrid Data Clustering Using Firefly Algorithm Based Improved Genetic Algorithm
    Maheshwar
    Kaushik, Keshav
    Arora, Vikram
    SECOND INTERNATIONAL SYMPOSIUM ON COMPUTER VISION AND THE INTERNET (VISIONNET'15), 2015, 58 : 249 - 256
  • [44] A Global Clustering Approach Using Hybrid Optimization for Incomplete Data Based on Interval Reconstruction of Missing Value
    Zhang, Liyong
    Lu, Wei
    Liu, Xiaodong
    Pedrycz, Witold
    Zhong, Chongquan
    Wang, Lu
    INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2016, 31 (04) : 297 - 313
  • [45] A comparison of multiple imputation with EM algorithm and MCMC method for quality of life missing data
    Ting Hsiang Lin
    Quality & Quantity, 2010, 44 : 277 - 287
  • [46] A clustering algorithm based on the weighted entropy of conditional attributes for mixed data
    Zhou, Jing
    Chen, Ke
    Liu, Jinsheng
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2021, 33 (17)
  • [47] Analysis of Longitudinal Clinical Trials with Missing Data Using Multiple Imputation in Conjunction with Robust Regression
    Mehrotra, Devan V.
    Li, Xiaoming
    Liu, Jiajun
    Lu, Kaifeng
    BIOMETRICS, 2012, 68 (04) : 1250 - 1259
  • [48] Selecting the number of imputed datasets when using multiple imputation for missing data and disclosure limitation
    Reiter, Jerome P.
    STATISTICS & PROBABILITY LETTERS, 2008, 78 (01) : 15 - 20
  • [49] Multiple imputation using auxiliary imputation variables that only predict missingness can increase bias due to data missing not at random
    Curnow, Elinor
    Cornish, Rosie P.
    Heron, Jon E.
    Carpenter, James R.
    Tilling, Kate
    BMC MEDICAL RESEARCH METHODOLOGY, 2024, 24 (01)
  • [50] Multiple imputation method of missing credit risk assessment data based on generative adversarial networks
    Zhao, Feng
    Lu, Yan
    Li, Xinning
    Wang, Lina
    Song, Yingjie
    Fan, Deming
    Zhang, Caiming
    Chen, Xiaobo
    APPLIED SOFT COMPUTING, 2022, 126