Handling data irregularities in classification: Foundations, trends, and future challenges

被引:161
作者
Das, Swagatam [1 ]
Datta, Shounak [1 ]
Chaudhuri, Bidyut B. [2 ]
机构
[1] Indian Stat Inst, Elect & Commun Sci Unit, 203 BT Rd, Kolkata 700108, India
[2] Indian Stat Inst, Comp Vis & Pattern Recognit Unit, 203 BT Rd, Kolkata 700108, India
关键词
Data irregularities; Class imbalance; Small disjuncts; Class-distribution skew; Missing features; Absent features; IMBALANCED DATA-SETS; LIKELIHOOD-BASED INFERENCE; EXTREME LEARNING-MACHINE; MISSING VALUES; SAMPLING METHOD; SMALL DISJUNCTS; NEURAL-NETWORK; IMPUTATION; REGRESSION; ENSEMBLES;
D O I
10.1016/j.patcog.2018.03.008
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most of the traditional pattern classifiers assume their input data to be well-behaved in terms of similar underlying class distributions, balanced size of classes, the presence of a full set of observed features in all data instances, etc. Practical datasets, however, show up with various forms of irregularities that are, very often, sufficient to confuse a classifier, thus degrading its ability to learn from the data. In this article, we provide a bird's eye view of such data irregularities, beginning with a taxonomy and characterization of various distribution-based and feature-based irregularities. Subsequently, we discuss the notable and recent approaches that have been taken to make the existing stand-alone as well as ensemble classifiers robust against such irregularities. We also discuss the interrelation and co-occurrences of the data irregularities including class imbalance, small disjuncts, class skew, missing features, and absent (non-existing or undefined) features. Finally, we uncover a number of interesting future research avenues that are equally contextual with respect to the regular as well as deep machine learning paradigms. (C) 2018 Elsevier Ltd. All rights reserved.
引用
收藏
页码:674 / 693
页数:20
相关论文
共 223 条
[1]   To Combat Multi-Class Imbalanced Problems by Means of Over-Sampling Techniques [J].
Abdi, Lida ;
Hashemi, Sattar .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2016, 28 (01) :238-251
[2]  
Acuña E, 2004, ST CLASS DAT ANAL, P639
[3]  
Ahmad S., 1993, ADV NEURAL INFORM PR, P393
[4]   Applying support vector machines to imbalanced datasets [J].
Akbani, R ;
Kwek, S ;
Japkowicz, N .
MACHINE LEARNING: ECML 2004, PROCEEDINGS, 2004, 3201 :39-50
[5]  
Ali K.M., 1992, COMPUTATIONAL LEARNI, V3
[6]  
[Anonymous], 1993, MORGAN KAUFMANN SERI
[7]  
[Anonymous], CORR
[8]  
[Anonymous], IEEE T SYST MAN CYBE
[9]  
[Anonymous], EXPERT SYSTEMS APPL, DOI DOI 10.1016/J.ESWA.2014.02.026
[10]  
[Anonymous], 2004, ACM SIGKDD EXPLORATI, DOI DOI 10.1145/1007730.1007737