Missing data analyses: a hybrid multiple imputation algorithm using Gray System Theory and entropy based on clustering

被引:50
|
作者
Tian, Jing [1 ]
Yu, Bing [1 ]
Yu, Dan [1 ]
Ma, Shilong [1 ,2 ]
机构
[1] Beihang Univ, State Key Lab Software Dev Environm, Beijing, Peoples R China
[2] Beihang Univ, Sch Comp Sci & Engn, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Missing data; Multiple imputation; Gray System Theory; Entropy; Clustering; ESTIMATING NULL VALUES; FUZZY C-MEANS; INFORMATION;
D O I
10.1007/s10489-013-0469-x
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Researchers and practitioners who use databases usually feel that it is cumbersome in knowledge discovery or application development due to the issue of missing data. Though some approaches can work with a certain rate of incomplete data, a large portion of them demands high data quality with completeness. Therefore, a great number of strategies have been designed to process missingness particularly in the way of imputation. Single imputation methods initially succeeded in predicting the missing values for specific types of distributions. Yet, the multiple imputation algorithms have maintained prevalent because of the further promotion of validity by minimizing the bias iteratively and less requirement on prior knowledge to the distributions. This article carefully reviews the state of the art and proposes a hybrid missing data completion method named Multiple Imputation using Gray-system-theory and Entropy based on Clustering (MIGEC). Firstly, the non-missing data instances are separated into several clusters. Then, the imputed value is obtained after multiple calculations by utilizing the information entropy of the proximal category for each incomplete instance in terms of the similarity metric based on Gray System Theory (GST). Experimental results on University of California Irvine (UCI) datasets illustrate the superiority of MIGEC to other current achievements on accuracy for either numeric or categorical attributes under different missing mechanisms. Further discussion on real aerospace datasets states MIGEC is also applicable for the specific area with both more precise inference and faster convergence than other multiple imputation methods in general.
引用
收藏
页码:376 / 388
页数:13
相关论文
共 50 条
  • [21] Imputation of missing data based on locally weighted algorithm
    College of Information Engineering, Shenyang University of Chemical Technology, Shenyang, China
    J. Comput. Inf. Syst., 4 (1195-1204): : 1195 - 1204
  • [22] Multiple imputation for missing data in a longitudinal cohort study: a tutorial based on a detailed case study involving imputation of missing outcome data
    Lee, Katherine J.
    Roberts, Gehan
    Doyle, Lex W.
    Anderson, Peter J.
    Carlin, John B.
    INTERNATIONAL JOURNAL OF SOCIAL RESEARCH METHODOLOGY, 2016, 19 (05) : 575 - 591
  • [23] Hybrid imputation-based optimal evidential classification for missing data
    Zhang, Zhen
    Tian, Hong-peng
    APPLIED INTELLIGENCE, 2025, 55 (01)
  • [24] Latent class based multiple imputation approach for missing categorical data
    Gebregziabher, Mulugeta
    DeSantis, Stacia M.
    JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2010, 140 (11) : 3252 - 3262
  • [25] Using multiple imputation to estimate missing data in meta-regression
    Ellington, E. Hance
    Bastille-Rousseau, Guillaume
    Austin, Cayla
    Landolt, Kristen N.
    Pond, Bruce A.
    Rees, Erin E.
    Robar, Nicholas
    Murray, Dennis L.
    METHODS IN ECOLOGY AND EVOLUTION, 2015, 6 (02): : 153 - 163
  • [26] Missing value estimation using clustering and deep learning within multiple imputation framework
    Samad, Manar D.
    Abrar, Sakib
    Diawara, Norou
    KNOWLEDGE-BASED SYSTEMS, 2022, 249
  • [27] Data stream clustering based on Fuzzy C-Mean algorithm and entropy theory
    Zhang, Baoju
    Qin, Shan
    Wang, Wei
    Wang, Dan
    Xue, Lei
    SIGNAL PROCESSING, 2016, 126 : 111 - 116
  • [28] Reference-based multiple imputation for missing data sensitivity analyses in trial-based cost-effectiveness analysis
    Leurent, Baptiste
    Gomes, Manuel
    Cro, Suzie
    Wiles, Nicola
    Carpenter, James R.
    HEALTH ECONOMICS, 2020, 29 (02) : 171 - 184
  • [29] Multiple imputation of missing marijuana data in the Fatality Analysis Reporting System using a Bayesian multilevel model
    Chen, Qixuan
    Williams, Sharifa Z.
    Liu, Yutao
    Chihuri, Stanford T.
    Li, Guohua
    ACCIDENT ANALYSIS AND PREVENTION, 2018, 120 : 262 - 269
  • [30] An Unsupervised Data-Mining and Generative-Based Multiple Missing Data Imputation Network for Energy Dataset
    Kim, Hyung Joon
    Kim, Mun Kyeom
    IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2024, 20 (11) : 13429 - 13440