An Information-Theoretic Approach for Setting the Optimal Number of Decision Trees in Random Forests

被引:17
作者
Cuzzocrea, Alfredo [1 ]
Francis, Shane Leo [2 ]
Gaber, Mohamed Medhat [2 ]
机构
[1] CNR, ICAR, I-00185 Rome, Italy
[2] Univ Portsmouth, Sch Comp, Portsmouth, Hants, England
来源
2013 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC 2013) | 2013年
关键词
Random Forests; Data Mining; Data Classification; Predictive Power; Information Gain; Ensemble Classification;
D O I
10.1109/SMC.2013.177
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Data Classification is a process within the Data Mining and Machine Learning field which aims at annotating all instances of a dataset by so-called class labels. This involves in creating a model from a training set of data instances which are already labeled, possibly being this model also used to define the class of data instances which are not classified already. A successful way of performing the classification process is provided by the algorithm Random Forests (RF), which is itself a type of Ensemble-based Classifier. An ensemble-based classifier increases the accuracy of the class label assigned to a data instance by using a set of classifiers that are modeled on different, but possibly overlapping, instance sets, and then combining the so-obtained intermediate classification results. To this end, RF particularly makes use of a number of decision trees to classify an instance, then taking the majority of votes from these trees as the final classifier. The latter one is a critical task of algorithm RF, which heavily impacts on the accuracy of the final classifier. In this paper, we propose a variation of algorithm RF, namely adjusting one of the two parameters that RF takes, the number of decision trees, dependant on a meaningful relation between the dataset predictive power rating and the number of trees itself, with the goal of improving accuracy and performance of the algorithm. This is finally demonstrated by our comprehensive experimental evaluation on several clean datasets.
引用
收藏
页码:1013 / 1019
页数:7
相关论文
共 18 条
[1]  
[Anonymous], 1996, ADV KNOWLEDGE DISCOV
[2]  
[Anonymous], 2009, WEKA DATA MINING SOF
[3]  
[Anonymous], 1996, MACHINE LEARNING
[4]   SOME EXAMPLES OF REGRESSION TOWARDS THE MEAN .7. [J].
BLAND, JM ;
ALTMAN, DG .
BRITISH MEDICAL JOURNAL, 1994, 309 (6957) :780-780
[5]  
Breiman L., 2001, MACH LEARN, V45, P5
[6]  
Brewlow L.A., 1997, KNOWLEDGE ENG REV, V12
[7]  
Cover T.M., 2006, ELEMENTS INFORM THEO, V2nd ed
[8]  
Gaber M., 2012, P KES 2012 INT C
[9]  
Gnanadesikan R., 1977, METHODS STAT DATA AN
[10]  
Gruhl D., 2005, P ACM SIGKDD 2005 IN