Democratic instance selection: A linear complexity instance selection algorithm based on classifier ensemble concepts

被引:68
作者
Garcia-Osorio, Cesar [2 ]
de Haro-Garcia, Aida [1 ]
Garcia-Pedrajas, Nicolas [1 ]
机构
[1] Univ Cordoba, Dept Comp & Numer Anal, E-14071 Cordoba, Spain
[2] Univ Burgos, Dept Civil Engn, Burgos, Spain
关键词
Instance selection; Instance-based learning; Ensembles; Huge problems; REDUCTION; RULES;
D O I
10.1016/j.artint.2010.01.001
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Instance selection is becoming increasingly relevant due to the huge amount of data that is constantly being produced in many fields of research. Although current algorithms are useful for fairly large datasets, scaling problems are found when the number of instances is in the hundreds of thousands or millions. When we face huge problems, scalability becomes an issue, and most algorithms are not applicable. Thus, paradoxically, instance selection algorithms are for the most part impracticable for the same problems that would benefit most from their use. This paper presents a way of avoiding this difficulty using several rounds of instance selection on subsets of the original dataset. These rounds are combined using a voting scheme to allow good performance in terms of testing error and storage reduction, while the execution time of the process is significantly reduced. The method is particularly efficient when we use instance selection algorithms that are high in computational cost. The proposed approach shares the philosophy underlying the construction of ensembles of classifiers. In an ensemble, several weak learners are combined to form a strong classifier; in our method several weak (in the sense that they are applied to subsets of the data) instance selection algorithms are combined to produce a strong and fast instance selection method. An extensive comparison of 30 medium and large datasets from the LICI Machine Learning Repository using 3 different classifiers shows the usefulness of our method. Additionally, the method is applied to 5 huge datasets (from three hundred thousand to more than a million instances) with good results and fast execution time. (C) 2010 Elsevier B.V. All rights reserved.
引用
收藏
页码:410 / 441
页数:32
相关论文
共 48 条
[1]   PLOTS OF HIGH-DIMENSIONAL DATA [J].
ANDREWS, DF .
BIOMETRICS, 1972, 28 (01) :125-&
[2]  
[Anonymous], WORKSH SUP UNS ENS M
[3]   THE GRAND TOUR - A TOOL FOR VIEWING MULTIDIMENSIONAL DATA [J].
ASIMOV, D .
SIAM JOURNAL ON SCIENTIFIC AND STATISTICAL COMPUTING, 1985, 6 (01) :128-143
[4]  
Baluja S., 1994, Population-based incremental learning: A method for integrating genetic search based function optimization and competitive learning
[5]   Decision boundary preserving prototype selection for nearest neighbor classification [J].
Barandela, R ;
Ferri, FJ ;
Sánchez, JS .
INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2005, 19 (06) :787-806
[6]   An empirical comparison of voting classification algorithms: Bagging, boosting, and variants [J].
Bauer, E ;
Kohavi, R .
MACHINE LEARNING, 1999, 36 (1-2) :105-139
[7]   Advances in instance selection for instance-based learning algorithms [J].
Brighton, H ;
Mellish, C .
DATA MINING AND KNOWLEDGE DISCOVERY, 2002, 6 (02) :153-172
[8]   Identifying mislabeled training data [J].
Brodley, CE ;
Friedl, MA .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 1999, 11 :131-167
[9]  
Buja A., 1986, Computer Science and Statistics. Proceedings of the Seventeenth Symposium on the Interface, P63
[10]  
BUJA A, 2005, COMPUTATIONAL METHOD, P391