A First Attempt on Global Evolutionary Undersampling for Imbalanced Big Data

被引:0
作者
Triguero, I. [1 ]
Galar, M. [3 ]
Bustince, H. [3 ]
Herrera, F. [2 ]
机构
[1] Univ Nottingham, Sch Comp Sci, Nottingham, England
[2] Univ Granada, Dept Comp Sci & Artificial Intelligence, CITIC UGR, E-18071 Granada, Spain
[3] Univ Publ Navarra, Dept Automat & Computat, Campus Arrosadia S-N, Pamplona 31006, Spain
来源
2017 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC) | 2017年
关键词
MAPREDUCE; CLASSIFICATION; INSIGHT;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The design of efficient big data learning models has become a common need in a great number of applications. The massive amounts of available data may hinder the use of traditional data mining techniques, especially when evolutionary algorithms are involved as a key step. Existing solutions typically follow a divide-and-conquer approach in which the data is split into several chunks that are addressed individually. Next, the partial knowledge acquired from every slice of data is aggregated in multiple ways to solve the entire problem. However, these approaches are missing a global view of the data as a whole, which may result in less accurate models. In this work we carry out a first attempt on the design of a global evolutionary undersampling model for imbalanced classification problems. These are characterised by having a highly skewed distribution of classes in which evolutionary models are being used to balance the dataset by selecting only the most relevant data. Using Apache Spark as big data technology, we have introduced a number of variations to the well-known CHC algorithm to work with very large chromosomes and reduce the costs associated to the fitness evaluation. We discuss some preliminary results, showing the great potential of this new kind of evolutionary big data model.
引用
收藏
页码:2054 / 2061
页数:8
相关论文
共 25 条
  • [1] A. F. Project, 2017, AP FLINK
  • [2] [Anonymous], 2012, NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
  • [3] [Anonymous], 2014, ECBDL14 dataset: Protein structure prediction and contact map for the ECBDL2014 big data competition
  • [4] [Anonymous], 2013, Apache Hadoop
  • [5] [Anonymous], 2003, P 19 ACM S OP SYST P, DOI [10.1145/1165389.945450, DOI 10.1145/1165389.945450]
  • [6] Breiman L., 2001, Machine Learning, V45, P5
  • [7] Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137
  • [8] On the use of MapReduce for imbalanced big data using Random Forest
    del Rio, Sara
    Lopez, Victoria
    Manuel Benitez, Jose
    Herrera, Francisco
    [J]. INFORMATION SCIENCES, 2014, 285 : 112 - 137
  • [9] Eshelman LJ., 1991, FDN GENETIC ALGORITH, P265, DOI DOI 10.1016/B978-0-08-050684-5.50020-3
  • [10] An introduction to ROC analysis
    Fawcett, Tom
    [J]. PATTERN RECOGNITION LETTERS, 2006, 27 (08) : 861 - 874