Effects of dataset characteristics on the performance of feature selection techniques

被引:49
作者
Oreski, Dijana [1 ]
Oreski, Stjepan [2 ]
Klicek, Bozidar [1 ]
机构
[1] Univ Zagreb, Fac Org & Informat, Pavlinska 2, Varazhdin 42000, Croatia
[2] Bank Karlovac, I G Kovac 1, Karlovac 47000, Croatia
基金
欧盟地平线“2020”;
关键词
Dataset characteristics; Feature selection; Comparative analysis; Data sparsity; Feature noisea; CLASSIFICATION ALGORITHMS;
D O I
10.1016/j.asoc.2016.12.023
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While extensive research in data mining has been devoted to developing better feature selection techniques, none of this research has examined the intrinsic relationship between dataset characteristics and a feature selection techniques performance. Thus, our research examines experimentally how dataset characteristics affect both the accuracy and the time complexity of feature selection. To evaluate the performance of various feature selection techniques on datasets of different characteristics, extensive experiments with five feature selection techniques, three types of classification algorithms, seven types of dataset characterization methods and all possible combinations of dataset characteristics are conducted on 128 publicly available datasets. We apply the decision tree method to evaluate the interdependencies between dataset characteristics and performance. The results of the study reveal the intrinsic relationship between dataset characteristics and feature selection techniques performance. Additionally, our study contributes to research in data mining by providing a roadmap for future research on feature selection and a significantly wider framework for comparative analysis. (C) 2016 Elsevier B.V. All rights reserved.
引用
收藏
页码:109 / 119
页数:11
相关论文
共 29 条
[1]   On learning algorithm selection for classification [J].
Ali, S ;
Smith, KA .
APPLIED SOFT COMPUTING, 2006, 6 (02) :119-138
[2]   Utilizing various sparsity measures for enhancing accuracy of collaborative recommender systems based on local and global similarities [J].
Anand, Deepa ;
Bharadwaj, Kamal K. .
EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (05) :5101-5109
[3]  
[Anonymous], 1984, An introduction to multivariate statistical analysis
[4]  
[Anonymous], 1999, Technometrics, DOI DOI 10.2307/1269742
[5]   Domain of competence of XCS classifier system in complexity measurement space [J].
Bernadó-Mansilla, E ;
Ho, TK .
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2005, 9 (01) :82-104
[6]  
Brazdil P., 1994, Machine Learning: ECML-94. European Conference on Machine Learning. Proceedings, P83
[7]   Data selection based on decision tree for SVM classification on large data sets [J].
Cervantes, Jair ;
Garcia Lamont, Farid ;
Lopez-Chau, Asdrubal ;
Rodriguez Mazahua, Lisbeth ;
Sergio Ruiz, J. .
APPLIED SOFT COMPUTING, 2015, 37 :787-798
[8]  
Chen C, 2011, 2011 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI), P384, DOI 10.1109/IRI.2011.6009578
[9]   Similarity of feature selection methods: An empirical study across data intensive classification tasks [J].
Dessi, Nicoletta ;
Pes, Barbara .
EXPERT SYSTEMS WITH APPLICATIONS, 2015, 42 (10) :4632-4642
[10]   The use of multiple measurements in taxonomic problems [J].
Fisher, RA .
ANNALS OF EUGENICS, 1936, 7 :179-188