Missing data imputation by K nearest neighbours based on grey relational structure and mutual information

被引:97
作者
Pan, Ruilin [1 ]
Yang, Tingsheng [1 ]
Cao, Jianhua [1 ]
Lu, Ke [1 ]
Zhang, Zhanchao [1 ]
机构
[1] Anhui Univ Technol, Sch Management Sci & Engn, Maanshan 243032, Peoples R China
基金
中国国家自然科学基金;
关键词
Missing data; Grey theory; Mutual information; Feature relevance; K nearest neighbours; FEATURE-SELECTION; ALGORITHM;
D O I
10.1007/s10489-015-0666-x
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Treatment of missing data has become increasingly significant in scientific research and engineering applications. The classic imputation strategy based on the K nearest neighbours (KNN) has been widely used to solve the plague problem. However, former studies do not give much attention to feature relevance, which has a significant impact on the selection of nearest neighbours. As a result, biased results may appear in similarity measurements. In this paper, we propose a novel method to impute missing data, named feature weighted grey KNN (FWGKNN) imputation algorithm. This approach employs mutual information (MI) to measure feature relevance. We present an experimental evaluation for five UCI datasets in three missingness mechanisms with various missing rates. Experimental results show that feature relevance has a non-ignorable influence on missing data estimation based on grey theory, and our method is considered superior to the other four estimation strategies. Moreover, the classification bias can be significantly reduced by using our approach in classification tasks.
引用
收藏
页码:614 / 632
页数:19
相关论文
共 46 条
  • [31] Pyle D., 1999, DATA PREPARATION DAT, V1
  • [32] Schafer JL., 1997, Analysis of Incomplete Multivariate Data, DOI DOI 10.1201/9781439821862
  • [33] Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data
    Sehgal, MSB
    Gondal, I
    Dooley, LS
    [J]. BIOINFORMATICS, 2005, 21 (10) : 2417 - 2423
  • [34] Silva JD, 2009, INT CONF INTELL SYST, P1400, DOI 10.1109/ISDA.2009.86
  • [35] Missing data analyses: a hybrid multiple imputation algorithm using Gray System Theory and entropy based on clustering
    Tian, Jing
    Yu, Bing
    Yu, Dan
    Ma, Shilong
    [J]. APPLIED INTELLIGENCE, 2014, 40 (02) : 376 - 388
  • [36] Missing value estimation methods for DNA microarrays
    Troyanskaya, O
    Cantor, M
    Sherlock, G
    Brown, P
    Hastie, T
    Tibshirani, R
    Botstein, D
    Altman, RB
    [J]. BIOINFORMATICS, 2001, 17 (06) : 520 - 525
  • [37] A discretization algorithm based on Class-Attribute Contingency Coefficient
    Tsai, Cheng-Jung
    Lee, Chien-I.
    Yang, Wei-Pang
    [J]. INFORMATION SCIENCES, 2008, 178 (03) : 714 - 731
  • [38] Incomplete-case nearest neighbor imputation in software measurement data
    Van Hulse, Jason
    Khoshgoftaar, Taghi M.
    [J]. INFORMATION SCIENCES, 2014, 259 : 596 - 610
  • [39] Nearest neighbour approach in the least-squares data imputation algorithms
    Wasito, I
    Mirkin, B
    [J]. INFORMATION SCIENCES, 2005, 169 (1-2) : 1 - 25
  • [40] Top 10 algorithms in data mining
    Wu, Xindong
    Kumar, Vipin
    Quinlan, J. Ross
    Ghosh, Joydeep
    Yang, Qiang
    Motoda, Hiroshi
    McLachlan, Geoffrey J.
    Ng, Angus
    Liu, Bing
    Yu, Philip S.
    Zhou, Zhi-Hua
    Steinbach, Michael
    Hand, David J.
    Steinberg, Dan
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2008, 14 (01) : 1 - 37