A MISSING DATA IMPUTATION METHOD WITH DISTANCE FUNCTION

被引:0
作者
Jea, Kuen-Fang [1 ]
Hsu, Chin-Wei [1 ]
Tang, Li-You [1 ]
机构
[1] Natl Chung Hsing Univ, Dept Comp Sci & Engn, Taichung 40227, Taiwan
来源
PROCEEDINGS OF 2018 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS (ICMLC), VOL 2 | 2018年
关键词
Big data; Missing data; Association rules; Distance function; Imputation;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
"Missing data" is an important research issue in big data analysis. This problem may cause data hard to analyze precisely. In recent research, several imputation-based methods have been proposed to solve the missing data issue without using domain knowledge. Among them, the missing data imputation method based on association rule mining was proposed to determine which value should be filled in the missing data. However, the generated rules may not always be suitable for filling in missing values. For example, some strong rules may fill up different missing values with the same result. We propose here an algorithm named RID (Rule-based Imputation with Distance function) to deal with this shortcoming. RID generates rules for missing data imputation by association rule mining and then uses a distance function to adjust the rule to fill in values appropriately. Experimental results show that the accuracy of RID is approximately 3 to 5 percentage higher than those of C4.5 and kNN, and approximately 6 to 7 percentage higher than that of HMiT.
引用
收藏
页码:450 / 455
页数:6
相关论文
共 10 条
  • [1] Agrawal R., P 20 INT C VERY LARG
  • [2] Anagnostopoulos C., 2014, P 20 ACM SIGKDD INT
  • [3] [Anonymous], 2014, STAT ANAL MISSING DA
  • [4] Baig A. R., 2006, P 10 WSEAS INT C COM
  • [5] MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM
    DEMPSTER, AP
    LAIRD, NM
    RUBIN, DB
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01): : 1 - 38
  • [6] Imputation of missing data in industrial databases
    Lakshminarayan, K
    Harp, SA
    Samad, T
    [J]. APPLIED INTELLIGENCE, 1999, 11 (03) : 259 - 275
  • [7] Ragel A, 1998, LECT NOTES ARTIF INT, V1394, P258
  • [8] Missing value estimation methods for DNA microarrays
    Troyanskaya, O
    Cantor, M
    Sherlock, G
    Brown, P
    Hastie, T
    Tibshirani, R
    Botstein, D
    Altman, RB
    [J]. BIOINFORMATICS, 2001, 17 (06) : 520 - 525
  • [9] Improved heterogeneous distance functions
    Wilson, DR
    Martinez, TR
    [J]. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 1997, 6 : 1 - 34
  • [10] Learning k for kNN Classification
    Zhang, Shichao
    Li, Xuelong
    Zong, Ming
    Zhu, Xiaofeng
    Cheng, Debo
    [J]. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2017, 8 (03)