Good methods for coping with missing data in decision trees

被引:92
作者
Twala, B. E. T. H. [2 ]
Jones, M. C. [1 ]
Hand, D. J. [3 ]
机构
[1] Open Univ, Dept Math & Stat, Milton Keynes MK7 6AA, Bucks, England
[2] Stat South Africa, Methodol & Stand Div, ZA-0001 Pretoria, South Africa
[3] Univ London Imperial Coll Sci Technol & Med, Dept Math, London SW7 2AZ, England
关键词
C4.5; CART; EM algorithm; fractional cases; missingness as attribute; multiple imputation;
D O I
10.1016/j.patrec.2008.01.010
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a simple and effective method for dealing with missing data in decision trees used for classification. We call this approach "missingness incorporated in attributes" (MIA). It is very closely related to the technique of treating "missing" as a category in its own right, generalizing it for use with continuous as well as categorical variables. We show through a substantial data-based study of classification accuracy that MIA exhibits consistently good performance across a broad range of data types and of sources and amounts of missingness. It is competitive with the best of the rest (particularly, a multiple imputation EM algorithm method; EMMI) while being conceptually and computationally simpler. A simple combination of MIA and EMMI is slower but even more accurate. (C) 2008 Elsevier B.V. All rights reserved.
引用
收藏
页码:950 / 956
页数:7
相关论文
共 13 条
  • [1] [Anonymous], 2022, INTRO RECURSIVE PART
  • [2] Becker RA, 1998, WADSWORTH BROOKSCOLE
  • [3] CESTNIK B, 1987, EUR WORK SESS LEARN
  • [4] Friedman J, 2001, The elements of statistical learning, V1, DOI DOI 10.1007/978-0-387-21606-5
  • [5] Li B., 1984, BIOMETRICS, V40, P358, DOI DOI 10.2307/2530946
  • [6] Little R. J., 2019, STAT ANAL MISSING DA, V793, DOI DOI 10.1002
  • [7] Newman D.J., 1998, UCI REPOSITORY MACHI
  • [8] Quinlan J. R., 1986, Machine Learning, V1, P81, DOI 10.1023/A:1022643204877
  • [9] Quinlan J. R., 2014, C4 5 PROGRAMS MACHIN
  • [10] Schafer JL., 1997, Analysis of incomplete multivariate data, DOI 10.1201/9781439821862