Selective sampling for trees and forests

被引:5
作者
Badarna, Murad [1 ]
Shimshoni, Ilan [1 ]
机构
[1] Univ Haifa, Fac Social Sci, Dept Informat Syst, Haifa, Israel
关键词
Selective sampling; Decision trees; Random forests; Classification; Active learning; SUPPORT VECTOR MACHINE; DECISION TREES; CLASSIFICATION;
D O I
10.1016/j.neucom.2019.04.071
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper we describe selective sampling algorithms for decision trees and random forests and their contribution to the classification accuracy. In our selective sampling algorithms, the instance that yields the highest expected utility is chosen to be labeled by the expert. We show that it is possible to obtain the most valuable unlabeled instance to be labeled by the expert and added to the training dataset of the decision tree simply by depicting the influence of this new instance on the class probabilities of the leaves. All the unlabeled instances that fall into the same leaf will have the same class probabilities. As a result, we can compute the expected accuracy of the decision tree according to its leaves instead for each individual unlabeled instance. An extension for random forests is also presented. Moreover, we show that the selective sampling classifier has to belong to the same family as the classifier whose accuracy we wish to improve but need not be identical to it. For example, a random forest classifier can be used for the selective sampling process, and the results can be used to improve the classification accuracy of a decision tree. Likewise, a random forest classifier consisting of three trees can be used in the selective sampling algorithm to improve the classification accuracy of a random forest consisting of ten trees. Our experiments show that the proposed selective sampling algorithms achieve better accuracy than the standard random sampling, uncertainty sampling and the active belief decision tree learning approach (ABC4.5) for several real-world datasets. We also show that our selective sampling algorithms improve significantly the classification performance of several state-of-the-art classifiers such as the random rotation forest classifier for real-world large-scale datasets. (C) 2019 Published by Elsevier B.V.
引用
收藏
页码:93 / 108
页数:16
相关论文
共 41 条
[1]  
AbdAllah Loai, 2013, International Journal of Business Intelligence and Data Mining, V8, P264, DOI 10.1504/IJBIDM.2013.059052
[2]   LOOKAHEAD SELECTIVE SAMPLING FOR INCOMPLETE DATA [J].
Abdallah, Loai ;
Shimshoni, Ilan .
INTERNATIONAL JOURNAL OF APPLIED MATHEMATICS AND COMPUTER SCIENCE, 2016, 26 (04) :871-884
[3]   Stability and scalability in decision trees [J].
Aluja-Banet, T ;
Nafria, E .
COMPUTATIONAL STATISTICS, 2003, 18 (04) :505-520
[4]  
[Anonymous], 1998, Proceedings of ICML-98, 15th International Conference on Machine Learning
[5]  
[Anonymous], 2012, Active Learning
[6]   The Importance of Pen Motion Pattern Groups for Semi-Automatic Classification of Handwriting into Mental Workload Classes [J].
Badarna, Murad ;
Shimshoni, Ilan ;
Luria, Gil ;
Rosenblum, Sara .
COGNITIVE COMPUTATION, 2018, 10 (02) :215-227
[7]  
Bishop C. M., 2006, PATTERN RECOGNITION, DOI DOI 10.1117/1.2819119
[8]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[9]   How to use expert advice [J].
CesaBianchi, N ;
Freund, Y ;
Haussler, D ;
Helmbold, DP ;
Schapire, RE ;
Warmuth, MK .
JOURNAL OF THE ACM, 1997, 44 (03) :427-485
[10]   IMPROVING GENERALIZATION WITH ACTIVE LEARNING [J].
COHN, D ;
ATLAS, L ;
LADNER, R .
MACHINE LEARNING, 1994, 15 (02) :201-221