Instance selection of linear complexity for big data

被引:50
作者
Arnaiz-Gonzalez, Alvar [1 ]
Diez-Pastor, Jose-Francisco [1 ]
Rodriguez, Juan J. [1 ]
Garcia-Osorio, Cesar [1 ]
机构
[1] Univ Burgos, Burgos, Spain
关键词
Nearest neighbor; Data reduction; Instance selection; Hashing; Big data; APPROXIMATE NEAREST-NEIGHBOR; PROTOTYPE SELECTION; REDUCTION TECHNIQUES; MUTUAL INFORMATION; ALGORITHMS; TOOL;
D O I
10.1016/j.knosys.2016.05.056
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Over recent decades, database sizes have grown considerably. Larger sizes present new challenges, because machine learning algorithms are not prepared to process such large volumes of information. Instance selection methods can alleviate this problem when the size of the data set is medium to large. However, even these methods face similar problems with very large-to-massive data sets. In this paper, two new algorithms with linear complexity for instance selection purposes are presented. Both algorithms use locality-sensitive hashing to find similarities between instances. While the complexity of conventional methods (usually quadratic, O(n(2)), or log-linear, O(n log n)) means that they are unable to process large-sized data sets, the new proposal shows competitive results in terms of accuracy. Even more remarkably, it shortens execution time, as the proposal manages to reduce complexity and make it linear with respect to the data set size. The new proposal has been compared with some of the best known instance selection methods for testing and has also been evaluated on large data sets (up to a million instances). (C) 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license.
引用
收藏
页码:83 / 95
页数:13
相关论文
共 63 条
  • [1] KEEL: a software tool to assess evolutionary algorithms for data mining problems
    Alcala-Fdez, J.
    Sanchez, L.
    Garcia, S.
    del Jesus, M. J.
    Ventura, S.
    Garrell, J. M.
    Otero, J.
    Romero, C.
    Bacardit, J.
    Rivas, V. M.
    Fernandez, J. C.
    Herrera, F.
    [J]. SOFT COMPUTING, 2009, 13 (03) : 307 - 318
  • [2] Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
    Andoni, Alexandr
    Indyk, Piotr
    [J]. COMMUNICATIONS OF THE ACM, 2008, 51 (01) : 117 - 122
  • [3] [Anonymous], 2010, MEMET COMPUT, DOI DOI 10.1007/S12293-010-0048-1
  • [4] [Anonymous], LOGIC J IGPL
  • [5] [Anonymous], 2000, Pattern Classification, DOI DOI 10.1007/978-3-319-57027-3_4
  • [6] Instance selection for regression by discretization
    Arnaiz-Gonzalez, Alvar
    Diez-Pastor, Jose F.
    Rodriguez, Juan J.
    Ignacio Garcia-Osorio, Cesar
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2016, 54 : 340 - 350
  • [7] A new fast prototype selection method based on clustering
    Arturo Olvera-Lopez, J.
    Ariel Carrasco-Ochoa, J.
    Francisco Martinez-Trinidad, J.
    [J]. PATTERN ANALYSIS AND APPLICATIONS, 2010, 13 (02) : 131 - 141
  • [8] An optimal algorithm for approximate nearest neighbor searching in fixed dimensions
    Arya, S
    Mount, DM
    Netanyahu, NS
    Silverman, R
    Wu, AY
    [J]. JOURNAL OF THE ACM, 1998, 45 (06) : 891 - 923
  • [9] THE GRAND TOUR - A TOOL FOR VIEWING MULTIDIMENSIONAL DATA
    ASIMOV, D
    [J]. SIAM JOURNAL ON SCIENTIFIC AND STATISTICAL COMPUTING, 1985, 6 (01): : 128 - 143
  • [10] Decision boundary preserving prototype selection for nearest neighbor classification
    Barandela, R
    Ferri, FJ
    Sánchez, JS
    [J]. INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2005, 19 (06) : 787 - 806