The optimal combination of feature selection and data discretization: An empirical study

被引:52
作者
Tsai, Chih-Fong [1 ]
Chen, Yu-Chi [1 ]
机构
[1] Natl Cent Univ, Dept Informat Management, Taoyuan, Taiwan
关键词
Data mining; Discretization; Feature selection; Machine learning; ALGORITHMS;
D O I
10.1016/j.ins.2019.07.091
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Feature selection and data discretization are two important data pre-processing steps in data mining, with the focus in the former being on filtering out unrepresentative features and in the latter on transferring continuous attributes into discrete ones. In the literature, these two domain problems have often been studied, individually. However, the combination of these two steps has not been fully explored, although both feature selection and discretization may be required for some real-world datasets. In this paper, two different combination orders of feature selection and discretization are examined in terms of their classification accuracies and computational times. Specifically, filter, wrapper, and embedded feature selection methods are employed, which are PCA, GA, and C4.5, respectively. For discretization, both supervised and unsupervised learning based discretizers are used, specifically MDLP, ChiMerge, equal frequency binning, and equal width binning. The experimental results, based on 10 UCI datasets, show that, for the SVM classifier performing MDLP first and C4.5 second outperforms the other combinations. Not only is less computational time required but this also provides the highest rate of classification accuracy. For the decision tree classifier, performing C4.5 first and MDLP second is recommended. (C) 2019 Elsevier Inc. All rights reserved.
引用
收藏
页码:282 / 293
页数:12
相关论文
共 41 条
[31]  
Quinlan J. R., 2014, C4 5 PROGRAMS MACHIN
[32]  
Ribeiro MX, 2008, APPLIED COMPUTING 2008, VOLS 1-3, P953
[33]   A review of feature selection techniques in bioinformatics [J].
Saeys, Yvan ;
Inza, Inaki ;
Larranaga, Pedro .
BIOINFORMATICS, 2007, 23 (19) :2507-2517
[34]   OPTIMAL DISCRETIZATION AND SELECTION OF FEATURES BY ASSOCIATION RATES OF JOINT DISTRIBUTIONS [J].
Santoni, Daniele ;
Weitschek, Emanuel ;
Felici, Giovanni .
RAIRO-OPERATIONS RESEARCH, 2016, 50 (02) :437-449
[35]  
Sheela L. J., 2009, INT J COMPUT ELECT E, V1, P179
[36]  
Tian D, 2011, LECT NOTES COMPUT SC, V6499, P135, DOI 10.1007/978-3-642-18302-7_9
[37]   A New Representation in PSO for Discretization-Based Feature Selection [J].
Tran, Binh ;
Xue, Bing ;
Zhang, Mengjie .
IEEE TRANSACTIONS ON CYBERNETICS, 2018, 48 (06) :1733-1746
[38]   Genetic algorithms in feature and instance selection [J].
Tsai, Chih-Fong ;
Eberle, William ;
Chu, Chi-Yuan .
KNOWLEDGE-BASED SYSTEMS, 2013, 39 :240-247
[39]   Top 10 algorithms in data mining [J].
Wu, Xindong ;
Kumar, Vipin ;
Quinlan, J. Ross ;
Ghosh, Joydeep ;
Yang, Qiang ;
Motoda, Hiroshi ;
McLachlan, Geoffrey J. ;
Ng, Angus ;
Liu, Bing ;
Yu, Philip S. ;
Zhou, Zhi-Hua ;
Steinbach, Michael ;
Hand, David J. ;
Steinberg, Dan .
KNOWLEDGE AND INFORMATION SYSTEMS, 2008, 14 (01) :1-37
[40]  
Yang Y, 2010, DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK, SECOND EDITION, P101, DOI 10.1007/978-0-387-09823-4_6