A genetic algorithm approach to optimising random forests applied to class engineered data

被引:55
作者
Elyan, Eyad [1 ]
Gaber, Mohamed Medhat [1 ]
机构
[1] Robert Gordon Univ, Sch Comp Sci & Digital Media, Garthdee Rd, Aberdeen AB10 7GJ, Scotland
关键词
Random forests; Genetic algorithm; Class decomposition; Life science; CLASSIFICATION; CLASSIFIERS; PREDICTION; PACKAGE;
D O I
10.1016/j.ins.2016.08.007
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In numerous applications and especially in the life science domain, examples are labelled at a higher level of granularity. For example, binary classification is dominant in many of these data sets, with the positive class denoting the existence of a particular disease in medical diagnosis applications. Such labelling does not depict the reality of having different categories of the same disease; a fact evidenced in the continuous research in root causes and variations of symptoms in a number of diseases. In a quest to enhance such diagnosis, data sets were decomposed using clustering of each class to reveal hidden categories. We then apply the widely adopted ensemble classification technique Random Forests. Such class decomposition has two advantages: (1) diversification of the input that enhances the ensemble classification; and (2) improving class separability, easing the follow-up classification process. However, to be able to apply Random Forests on such class decomposed data, three main parameters need to be set: number of trees forming the ensemble, number of features to split on at each node, and a vector representing the number of clusters in each class. The large search space for tuning these parameters has motivated the use of Genetic Algorithm to optimise the solution. A thorough experimental study on 22 real data sets was conducted, predominantly in a variety of life science applicatiobs. To prove the applicability of the method to other areas of application, the proposed method was tested on a number of data sets from other domains. Three variations of Random Forests including the proposed method as well as a boosting ensemble classifier were used in the experimental study. The results prove the superiority of the proposed method in boosting up the accuracy. Crown Copyright (C) 2016 Published by Elsevier Inc. All rights reserved.
引用
收藏
页码:220 / 234
页数:15
相关论文
共 40 条
[1]  
Alfaro E, 2013, J STAT SOFTW, V54, P1
[2]  
Analytics R., 2014, R PACKAGE VERSION 1
[3]   An ensemble-based system for automatic screening of diabetic retinopathy [J].
Antal, Balint ;
Hajdu, Andras .
KNOWLEDGE-BASED SYSTEMS, 2014, 60 :20-27
[4]   Performance analysis of support vector machines classifiers in breast cancer mammography recognition [J].
Azar, Ahmad Taher ;
El-Said, Shaimaa Ahmed .
NEURAL COMPUTING & APPLICATIONS, 2014, 24 (05) :1163-1177
[5]   A random forest classifier for lymph diseases [J].
Azar, Ahmad Taher ;
Elshazly, Hanaa Ismail ;
Hassanien, Aboul Ella ;
Elkorany, Abeer Mohamed .
COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2014, 113 (02) :465-473
[6]   Decision tree classifiers for automated medical diagnosis [J].
Azar, Ahmad Taher ;
El-Metwally, Shereen M. .
NEURAL COMPUTING & APPLICATIONS, 2013, 23 (7-8) :2387-2403
[7]  
Bache K., 2013, UCI Machine Learning Repository
[8]  
Bader-El-Den M, 2012, LECT NOTES COMPUT SC, V7664, P506, DOI 10.1007/978-3-642-34481-7_62
[9]   A survey on optimization metaheuristics [J].
Boussaid, Ilhern ;
Lepagnot, Julien ;
Siarry, Patrick .
INFORMATION SCIENCES, 2013, 237 :82-117
[10]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32