New Splitting Criteria for Decision Trees in Stationary Data Streams

被引:73
作者
Jaworski, Maciej [1 ]
Duda, Piotr [1 ]
Rutkowski, Leszek [1 ,2 ]
机构
[1] Czestochowa Tech Univ, Inst Computat Intelligence, PL-42200 Czestochowa, Poland
[2] Univ Social Sci, Inst Informat Technol, PL-90113 Lodz, Poland
关键词
Classification; data stream; decision trees; impurity measure; splitting criterion; MINING DATA STREAMS; ALGORITHM;
D O I
10.1109/TNNLS.2017.2698204
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The most popular tools for stream data mining are based on decision trees. In previous 15 years, all designed methods, headed by the very fast decision tree algorithm, relayed on Hoeffding's inequality and hundreds of researchers followed this scheme. Recently, we have demonstrated that although the Hoeffding decision trees are an effective tool for dealing with stream data, they are a purely heuristic procedure; for example, classical decision trees such as ID3 or CART cannot be adopted to data stream mining using Hoeffding's inequality. Therefore, there is an urgent need to develop new algorithms, which are both mathematically justified and characterized by good performance. In this paper, we address this problem by developing a family of new splitting criteria for classification in stationary data streams and investigating their probabilistic properties. The new criteria, derived using appropriate statistical tools, are based on the misclassification error and the Gini index impurity measures. The general division of splitting criteria into two types is proposed. Attributes chosen based on type-I splitting criteria guarantee, with high probability, the highest expected value of split measure. Type-I I criteria ensure that the chosen attribute is the same, with high probability, as it would be chosen based on the whole infinite data stream. Moreover, in this paper, two hybrid splitting criteria are proposed, which are the combinations of single criteria based on the misclassification error and Gini index.
引用
收藏
页码:2516 / 2529
页数:14
相关论文
共 34 条
  • [1] [Anonymous], 2007, Data Streams: Models and Algorithms
  • [2] [Anonymous], 2013, TECH REP
  • [3] [Anonymous], TECH REP
  • [4] [Anonymous], 2014, C4. 5: programs for machine learning
  • [5] [Anonymous], 2004, COMPUTER SCI
  • [6] [Anonymous], METRIKA
  • [7] [Anonymous], 2015 INT JOINT C NEU
  • [8] [Anonymous], 2009, Neural Network Learning: Theoretical Foundations
  • [9] Bifet A, 2010, FRONT ARTIF INTEL AP, V207, P1, DOI 10.3233/978-1-60750-472-6-i
  • [10] Dealing With Concept Drifts in Process Mining
    Bose, R. P. Jagadeesh Chandra
    van der Aalst, Wil M. P.
    Zliobaite, Indre
    Pechenizkiy, Mykola
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2014, 25 (01) : 154 - 171