Modelling a Stable Classifier for Handling Large Scale Data with Noise and Imbalance

被引:0
作者
Somasundaram, Akila [1 ]
Reddy, U. Srinivasulu [1 ]
机构
[1] Natl Inst Technol, Dept Comp Applicat, Tiruchirappalli 620015, Tamil Nadu, India
来源
2017 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE IN DATA SCIENCE (ICCIDS) | 2017年
关键词
Data Imbalance; Noise; Borderline data; Classification; Performance Metrics; Ensemble Models; Boosting; Stacking; Big Data; RANKING;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Classifier performance is often impaired by the presence of anomalies like noisy and borderlines samples, and due to the inherent imbalance in data. This is due to the fact that classifier models are usually constructed on the basis of ideal data conditions which is often not the case. In reality, these anomalies occur at varying intensities and in most cases they are an integral part of the problem domain. This requires that the classifier models be fine-tuned to accommodate such anomalies thereby resulting in data dependent models. This work analyses the effectiveness of various classifier models in handling noisy, borderline and imbalanced data. This dictates that, the right set of metrics must first be identified, as most of the usual metrics are not affected by such anomalies, though it affects the reliability, robustness and practical efficacy of such classifiers. To ensure the scalability of the resulting models, classifiers were implemented using Spark. A characterized examination of the results elucidates the effective prediction zones of each model, facilitating the identification of stable classifier models. It is found that a single model is inadequate in real time scenarios, due to the complex interplay among the various anomalies. This work is concluded with a modelling a heterogeneous cost based ensemble model for a domain based prediction model.
引用
收藏
页数:6
相关论文
共 30 条
[1]   IMBALANCE AND ITS INFLUENCE ON VARIANCE COMPONENT ESTIMATION [J].
AHRENS, HJ ;
SANCHEZ, JE .
BIOMETRICAL JOURNAL, 1992, 34 (05) :539-555
[2]  
Akila S., 2016, Proceedings of ICRECT, V16, P28
[3]  
Alcalá-Fdez J, 2011, J MULT-VALUED LOG S, V17, P255
[4]   COMPARISONS OF DESIGNS AND ESTIMATION PROCEDURES FOR ESTIMATING PARAMETERS IN A 2-STAGE NESTED PROCESS [J].
ANDERSON, RL ;
CRUMP, PP .
TECHNOMETRICS, 1967, 9 (04) :499-&
[5]   A Survey of Predictive Modeling on Im balanced Domains [J].
Branco, Paula ;
Torgo, Luis ;
Ribeiro, Rita P. .
ACM COMPUTING SURVEYS, 2016, 49 (02)
[6]   EFFECTS OF DATA IMBALANCE ON ESTIMATION OF HERITABILITY [J].
CARO, RF ;
GROSSMAN, M ;
FERNANDO, RL .
THEORETICAL AND APPLIED GENETICS, 1985, 69 (5-6) :523-530
[7]  
Chen RQ, 2013, 2013 NINTH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION (ICNC), P873, DOI 10.1109/ICNC.2013.6818099
[8]   Ensembles of label noise filters: a ranking approach [J].
Garcia, Luis P. F. ;
Lorena, Ana C. ;
Matwin, Stan ;
de Carvalho, Andre C. P. L. F. .
DATA MINING AND KNOWLEDGE DISCOVERY, 2016, 30 (05) :1192-1216
[9]   RHSBoost: Improving classification performance in imbalance data [J].
Gong, Joonho ;
Kim, Hyunjoong .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2017, 111 :1-13
[10]   A novel model for credit card fraud detection using Artificial Immune Systems [J].
Halvaiee, Neda Soltani ;
Akbari, Mohammad Kazem .
APPLIED SOFT COMPUTING, 2014, 24 :40-49