Instance Reduction for Avoiding Overfitting in Decision Trees

被引:32
作者
Amro, Asma' [1 ]
Al-Akhras, Mousa [2 ]
El Hindi, Khalil [3 ]
Habib, Mohamed [2 ,4 ]
Abu Shawar, Bayan [5 ]
机构
[1] Univ Jordan, King Abdullah II Sch Informat Technol, Comp Informat Syst Dept, Amman 11942, Jordan
[2] Saudi Elect Univ, Coll Comp & Informat, Comp Sci Dept, Riyadh 11673, Saudi Arabia
[3] King Saud Univ, Coll Comp & Informat Sci, Dept Comp Sci, Riyadh 11543, Saudi Arabia
[4] Port Said Univ, Fac Engn, Port Said 42526, Egypt
[5] Al Ain Univ, Coll Engn, Cybersecur Dept, Al Ain, U Arab Emirates
关键词
Decision Trees; Overfitting; Pruning; Instance Reduction; Noise filtering;
D O I
10.1515/jisys-2020-0061
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Decision trees learning is one of the most practical classification methods in machine learning, which is used for approximating discrete-valued target functions. However, they may overfit the training data, which limits their ability to generalize to unseen instances. In this study, we investigated the use of instance reduction techniques to smooth the decision boundaries before training the decision trees. Noise filters such as ENN, RENN, and ALLKNN remove noisy instances while DROP3 and DROP5 may remove genuine instances. Extensive empirical experiments were conducted on 13 benchmark datasets from UCI machine learning repository with and without intentionally introduced noise. Empirical results show that eliminating border instances improves the classification accuracy of decision trees and reduces the tree size, which reduces the training and classification times. In datasets without intentionally added noise, applying noise filters without the use of the built-in Reduced Error Pruning gave the best classification accuracy. ENN, RENN, and ALLKNN outperformed decision trees learning without pruning in 9, 9, and 8 out of 13 datasets, respectively. The datasets reduced using ENN and RENN without built-in pruning were more effective when noise was intentionally introduced in different ratios.
引用
收藏
页码:438 / 459
页数:22
相关论文
共 29 条
[1]   A novel decision tree classification based on post-pruning with Bayes minimum risk [J].
Ahmed, Ahmed Mohamed ;
Rizaner, Ahmet ;
Ulusoy, Ali Hakan .
PLOS ONE, 2018, 13 (04)
[2]  
[Anonymous], 2004, Introduction to machine learning
[3]  
[Anonymous], UCI Repository of machine learning databases
[4]  
Boros T., 2017, RANLP, P103, DOI DOI 10.26615/978-954-452-049-6_016
[5]  
Brunello Andrea, 2017, International Journal of Machine Learning and Computing, V7, P167, DOI 10.18178/ijmlc.2017.7.6.641
[6]  
BRUNK CA, 1991, MACHINE LEARNING, P389
[7]  
Czarnowski I., 2006, Proceedings of the Scientific Session organized during XXI Fall Meeting of the Polish Information Processing Society, Informatica, ANNALES Universitatis Mariae Curie-Sklodowska, Lublin, P60
[8]   Cluster-based instance selection for machine classification [J].
Czarnowski, Ireneusz .
KNOWLEDGE AND INFORMATION SYSTEMS, 2012, 30 (01) :113-133
[9]  
Domingos P, 1996, MACH LEARN, V24, P141
[10]  
El Hindi K, 2009, IADIS INT C INT SYST