MR-DIS: democratic instance selection for big data by MapReduce

被引:23
作者
Arnaiz-González Á. [1 ]
González-Rogel A. [1 ]
Díez-Pastor J.-F. [1 ]
López-Nozal C. [1 ]
机构
[1] University of Burgos, Avda. Cantabria s/n, Burgos, 09006, Burgos
关键词
Apache Spark; Big data; Classification; Democratic Instance Selection; Instance selection; MapReduce;
D O I
10.1007/s13748-017-0117-5
中图分类号
学科分类号
摘要
Instance selection is a popular preprocessing task in knowledge discovery and data mining. Its purpose is to reduce the size of data sets maintaining their predictive capabilities. The usual emerging problem at this point is that these methods quite often suffer of high computational complexity, which becomes highly inconvenient for processing huge data sets. In this paper, a parallel implementation for the instance selection algorithm Democratic Instance Selection (DIS) is presented. The main advantages of the DIS algorithm turn out to be its computational complexity, linear in the number of instances, as well as its internal structure, intuitively parallelizable. The purpose of this paper is threefold: firstly, the design of the DIS algorithm by following the MapReduce model; secondly, its implementation in the popular big data framework Spark; and finally, its empirical comparison over large-scale data sets. The results show that the processing time is reduced in a linear manner as the number of Spark executors increases, what makes it suitable for big data applications. In addition, the algorithm is publicly accessible to the scientific community. © 2017, Springer-Verlag Berlin Heidelberg.
引用
收藏
页码:211 / 219
页数:8
相关论文
共 26 条
[1]  
Amdahl G.M., Validity of the single processor approach to achieving large scale computing capabilities, Proceedings of the April 18–20, 1967, Spring Joint Computer Conference, AFIPS ’67 (Spring), pp. 483-485, (1967)
[2]  
Angiulli F., Folino G., Distributed nearest neighbor-based condensation of very large data sets, IEEE Trans. Knowl. Data Eng., 19, 12, pp. 1593-1606, (2007)
[3]  
Arnaiz-Gonzalez A., Diez-Pastor J.F., Rodriguez J.J., Garcia-Osorio C.I., Instance selection of linear complexity for big data, Knowl. Based Syst., 107, pp. 83-95, (2016)
[4]  
Asimov D., The grand tour: a tool for viewing multidimensional data, SIAM J. Sci. Stat. Comput., 6, 1, pp. 128-143, (1985)
[5]  
Brighton H., Mellish C., Advances in instance selection for instance-based learning algorithms, Data Min. Knowl. Discov., 6, 2, pp. 153-172, (2002)
[6]  
Cano J.R., Herrera F., Lozano M., Stratification for scaling up evolutionary prototype selection, Pattern Recognit. Lett., 26, 7, pp. 953-963, (2005)
[7]  
Chen M., Mao S., Liu Y., Big data: a survey, Mob. Netw. Appl., 19, 2, pp. 171-209, (2014)
[8]  
de Haro-Garcia A., Garcia-Pedrajas N., A divide-and-conquer recursive approach for scaling up instance selection algorithms, Data Min. Knowl. Discov., 18, 3, pp. 392-418, (2009)
[9]  
Dean J., Ghemawat S., MapReduce: simplified data processing on large clusters, Commun. ACM, 51, 1, pp. 107-113, (2008)
[10]  
Garcia S., Derrac J., Cano J., Herrera F., Prototype selection for nearest neighbor classification: taxonomy and empirical study, IEEE Trans. Pattern Anal. Mach. Intell., 34, 3, pp. 417-435, (2012)