Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem

被引:70
作者
Rendon, Erendira [1 ]
Alejo, Roberto [1 ]
Castorena, Carlos [1 ]
Isidro-Ortega, Frank J. [1 ]
Granda-Gutierrez, Everardo E. [2 ]
机构
[1] Natl Inst Technol Mexico, Div Postgrad Studies & Res, IT Toluca, Av Tecnol S-N, It Toluca 52149, Mexico
[2] Autonomous Univ State Mexico, UAEM Univ Ctr Atlacomulco, Carretera Toluca Atlacomulco Km 60, Atlacomulco 50450, Mexico
来源
APPLIED SCIENCES-BASEL | 2020年 / 10卷 / 04期
关键词
big data; multi-class imbalance problem; sampling methods; hyper-spectral remote sensing images; NEURAL-NETWORKS; CLASSIFICATION; INSIGHT; SMOTE;
D O I
10.3390/app10041276
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
The class imbalance problem has been a hot topic in the machine learning community in recent years. Nowadays, in the time of big data and deep learning, this problem remains in force. Much work has been performed to deal to the class imbalance problem, the random sampling methods (over and under sampling) being the most widely employed approaches. Moreover, sophisticated sampling methods have been developed, including the Synthetic Minority Over-sampling Technique (SMOTE), and also they have been combined with cleaning techniques such as Editing Nearest Neighbor or Tomek's Links (SMOTE+ENN and SMOTE+TL, respectively). In the big data context, it is noticeable that the class imbalance problem has been addressed by adaptation of traditional techniques, relatively ignoring intelligent approaches. Thus, the capabilities and possibilities of heuristic sampling methods on deep learning neural networks in big data domain are analyzed in this work, and the cleaning strategies are particularly analyzed. This study is developed on big data, multi-class imbalanced datasets obtained from hyper-spectral remote sensing images. The effectiveness of a hybrid approach on these datasets is analyzed, in which the dataset is cleaned by SMOTE followed by the training of an Artificial Neural Network (ANN) with those data, while the neural network output noise is processed with ENN to eliminate output noise; after that, the ANN is trained again with the resultant dataset. Obtained results suggest that best classification outcome is achieved when the cleaning strategies are applied on an ANN output instead of input feature space only. Consequently, the need to consider the classifier's nature when the classical class imbalance approaches are adapted in deep learning and big data scenarios is clear.
引用
收藏
页数:15
相关论文
共 73 条
[11]  
[Anonymous], 2016, Deep Learning
[12]  
[Anonymous], 2012, IEEE T SYST MAN CY C, DOI DOI 10.1109/TSMCC.2011.2161285
[13]  
[Anonymous], 1997, P 14 INT C MACH LEAR
[14]  
[Anonymous], 2000, Pattern Classification
[15]  
[Anonymous], 2018, LEARNING IMBALANCED, DOI [10.1007/978-3-319-98074-4, DOI 10.1007/978-3-319-98074-4]
[16]   A review of instance selection methods [J].
Arturo Olvera-Lopez, J. ;
Ariel Carrasco-Ochoa, J. ;
Francisco Martinez-Trinidad, J. ;
Kittler, Josef .
ARTIFICIAL INTELLIGENCE REVIEW, 2010, 34 (02) :133-143
[17]  
Batista G. E. A. P. A., 2004, ACM SIGKDD Explor Newsl, V6, P20, DOI [10.1145/1007730.1007735, DOI 10.1145/1007730.1007735]
[18]  
Bhowmick Kiran, 2019, Information Systems Design and Intelligent Applications. Proceedings of Fifth International Conference INDIA 2018. Advances in Intelligent Systems and Computing (AISC 863), P109, DOI 10.1007/978-981-13-3338-5_11
[19]  
Blaszczynski J, 2018, STUD COMPUT INTELL, V738, P51, DOI 10.1007/978-3-319-67946-4_2
[20]   A systematic study of the class imbalance problem in convolutional neural networks [J].
Buda, Mateusz ;
Maki, Atsuto ;
Mazurowski, Maciej A. .
NEURAL NETWORKS, 2018, 106 :249-259