Democratic instance selection: A linear complexity instance selection algorithm based on classifier ensemble concepts

被引:68
作者
Garcia-Osorio, Cesar [2 ]
de Haro-Garcia, Aida [1 ]
Garcia-Pedrajas, Nicolas [1 ]
机构
[1] Univ Cordoba, Dept Comp & Numer Anal, E-14071 Cordoba, Spain
[2] Univ Burgos, Dept Civil Engn, Burgos, Spain
关键词
Instance selection; Instance-based learning; Ensembles; Huge problems; REDUCTION; RULES;
D O I
10.1016/j.artint.2010.01.001
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Instance selection is becoming increasingly relevant due to the huge amount of data that is constantly being produced in many fields of research. Although current algorithms are useful for fairly large datasets, scaling problems are found when the number of instances is in the hundreds of thousands or millions. When we face huge problems, scalability becomes an issue, and most algorithms are not applicable. Thus, paradoxically, instance selection algorithms are for the most part impracticable for the same problems that would benefit most from their use. This paper presents a way of avoiding this difficulty using several rounds of instance selection on subsets of the original dataset. These rounds are combined using a voting scheme to allow good performance in terms of testing error and storage reduction, while the execution time of the process is significantly reduced. The method is particularly efficient when we use instance selection algorithms that are high in computational cost. The proposed approach shares the philosophy underlying the construction of ensembles of classifiers. In an ensemble, several weak learners are combined to form a strong classifier; in our method several weak (in the sense that they are applied to subsets of the data) instance selection algorithms are combined to produce a strong and fast instance selection method. An extensive comparison of 30 medium and large datasets from the LICI Machine Learning Repository using 3 different classifiers shows the usefulness of our method. Additionally, the method is applied to 5 huge datasets (from three hundred thousand to more than a million instances) with good results and fast execution time. (C) 2010 Elsevier B.V. All rights reserved.
引用
收藏
页码:410 / 441
页数:32
相关论文
共 48 条
[21]  
Domingos P., 2001, ADV NEURAL INFORM PR, P673
[22]  
Eshelman L.J., 1990, The CHC Adaptive Search Algorithm: How To Have Safe Search when Engaging in Nontraditional Genetic Recombination
[23]   Regaining sparsity in kernel principal components [J].
García-Osorio, C ;
Fyfe, C .
NEUROCOMPUTING, 2005, 67 :398-402
[24]  
García-Pedrajas N, 2007, J MACH LEARN RES, V8, P1
[25]   Constructing Ensembles of Classifiers by Means of Weighted Instance Selection [J].
Garcia-Pedrajas, Nicolas .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 2009, 20 (02) :258-277
[26]   A cooperative coevolutionary algorithm for instance selection for instance-based learning [J].
Garcia-Pedrajas, Nicolas ;
Antonio Romero del Castillo, Juan ;
Ortiz-Boyer, Domingo .
MACHINE LEARNING, 2010, 78 (03) :381-420
[27]  
GATES GW, 1972, IEEE T INFORM THEORY, V18, P431, DOI 10.1109/TIT.1972.1054809
[28]  
Golberg D. E., 1989, GENETIC ALGORITHMS S, V1989, P36
[29]  
Hettich S., 1998, UCI REPOSITORY MACHI