Adjusted weight voting algorithm for random forests in handling missing values

被引:69
作者
Xia, Jing [1 ]
Zhang, Shengyu [1 ]
Cai, Guolong [2 ]
Li, Li [2 ]
Pan, Qing [3 ]
Yan, Jing [2 ]
Ning, Gangmin [1 ]
机构
[1] Zhejiang Univ, Dept Biomed Engn, Key Lab Biomed Engn, Minist Educ, 38 Zheda Rd, Hangzhou 310027, Zhejiang, Peoples R China
[2] Zhejiang Hosp, Dept ICU, 12 Lingyin Rd, Hangzhou 310013, Zhejiang, Peoples R China
[3] Zhejiang Univ Technol, Coll Informat Engn, 288 Liuhe Rd, Hangzhou 310023, Zhejiang, Peoples R China
基金
中国国家自然科学基金;
关键词
Random forests; Missing values; Imputation approaches; Surrogate decisions; Weighted voting; CLASSIFICATION; IMPUTATION;
D O I
10.1016/j.patcog.2017.04.005
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Random forests (RF) is known as an efficient algorithm in classification, however it depends on the integrity of datasets. Conventional methods in dealing with missing values usually employ estimation and imputation approaches whose efficiency is tied to the assumptions of data features. Recently, algorithm of surrogate decisions in RF was developed and this paper proposes a random forests algorithm with modified surrogate splits (Adjusted Weight Voting Random Forest, AWVRF) which is able to address the incomplete data without imputation. Differing from the present surrogate method, in AWVRF algorithm, when the primary splitting attribute and the surrogate attributes of an internal node are all missing, the undergoing instance is allowed to exit at the current node with a vote. Then the weight of the vote is adjusted by the strength of the involved attributes and the final decision is made by weighted voting. AWVRF does not comprise imputation step, thus it is independent of data features. AWVRF is compared with the methods of mean imputation, LeoFill, knnimpute, BPCAfill and conventional RF with surrogate decisions (surrRF) using 50 times repeated 5-fold cross validation on 10 acknowledged datasets. In a total of 22 experiment settings, the method of AWVRF harvests the highest accuracy in 14 settings and the largest AUC in 7 settings, exhibiting its superiority over other methods. Compared with surrRF, AWVRF is significantly more efficient and remain good discrimination of prediction. Experimental results show that the present AWVRF algorithm can successfully handle the classification task for incomplete data. (C) 2017 Elsevier Ltd. All rights reserved.
引用
收藏
页码:52 / 60
页数:9
相关论文
共 39 条
  • [1] [Anonymous], 2014, XIII Mediterranean Conference on Medical and Biological Engineering and Computing 2013, IFMBE Proceedings
  • [2] Bache K., 2013, UCI Machine Learning Repository
  • [3] An introduction to modern missing data analyses
    Baraldi, Amanda N.
    Enders, Craig K.
    [J]. JOURNAL OF SCHOOL PSYCHOLOGY, 2010, 48 (01) : 5 - 37
  • [4] Bernard S, 2009, LECT NOTES COMPUT SC, V5519, P171, DOI 10.1007/978-3-642-02326-2_18
  • [5] Biau G, 2012, J MACH LEARN RES, V13, P1063
  • [6] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [7] Breiman L., 2008, Random forests
  • [8] Breiman L.I., 1984, BIOMETRICS, V40, P358, DOI DOI 10.2307/2530946
  • [9] Random forests for genomic data analysis
    Chen, Xi
    Ishwaran, Hemant
    [J]. GENOMICS, 2012, 99 (06) : 323 - 329
  • [10] Conroy B., 2015, MACH LEARN, P1