robROSE: A robust approach for dealing with imbalanced data in fraud detection

被引:12
作者
Baesens, Bart [1 ]
Hoeppner, Sebastiaan [2 ]
Ortner, Irene [3 ]
Verdonck, Tim [4 ]
机构
[1] Katholieke Univ Leuven, Fac Econ & Business, Naamsestr 69, B-3000 Leuven, Belgium
[2] Katholieke Univ Leuven, Dept Math, Celestijnenlaan 200B, B-3001 Leuven, Belgium
[3] Appl Stat GmbH, Taubstummengasse 4-10, A-1040 Vienna, Austria
[4] Univ Antwerp, Dept Math, Middelheimlaan 1, B-2020 Antwerp, Belgium
关键词
Fraud analysis; Skewed data; Outliers; Oversampling; Binary classification;
D O I
10.1007/s10260-021-00573-7
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
A major challenge when trying to detect fraud is that the fraudulent activities form a minority class which make up a very small proportion of the data set. In most data sets, fraud occurs in typically less than 0.5% of the cases. Detecting fraud in such a highly imbalanced data set typically leads to predictions that favor the majority group, causing fraud to remain undetected. We discuss some popular oversampling techniques that solve the problem of imbalanced data by creating synthetic samples that mimic the minority class. A frequent problem when analyzing real data is the presence of anomalies or outliers. When such atypical observations are present in the data, most oversampling techniques are prone to create synthetic samples that distort the detection algorithm and spoil the resulting analysis. A useful tool for anomaly detection is robust statistics, which aims to find the outliers by first fitting the majority of the data and then flagging data observations that deviate from it. In this paper, we present a robust version of ROSE, called robROSE, which combines several promising approaches to cope simultaneously with the problem of imbalanced data and the presence of outliers. The proposed method achieves to enhance the presence of the fraud cases while ignoring anomalies. The good performance of our new sampling technique is illustrated on simulated and real data sets and it is shown that robROSE can provide better insight in the structure of the data. The source code of the robROSE algorithm is made freely available.
引用
收藏
页码:841 / 861
页数:21
相关论文
共 32 条
[1]  
AdrianW Bowman, 1997, APPL SMOOTHING TECHN, V18
[2]   Cost Sensitive Credit Card Fraud Detection using Bayes Minimum Risk [J].
Bahnsen, Alejandro Correa ;
Stojanovic, Aleksandar ;
Aouada, Djamila ;
Ottersten, Bjoern .
2013 12TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2013), VOL 1, 2013, :333-338
[3]   MWMOTE-Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning [J].
Barua, Sukarna ;
Islam, Md. Monirul ;
Yao, Xin ;
Murase, Kazuyuki .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (02) :405-425
[4]  
Breiman L., 2017, Classification and Regression Trees, DOI [DOI 10.1201/9781315139470, 10.1201/9781315139470/CLASSIFICATION-REGRESSION-TREES-LEO-BREIMAN-JEROME-FRIEDMAN-RICHARD-OLSHEN-CHARLES-STONE]
[5]   Robust inference for generalized linear models [J].
Cantoni, E ;
Ronchetti, E .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2001, 96 (455) :1022-1030
[6]   Robust clustering around regression lines with high density regions [J].
Cerioli, Andrea ;
Perrotta, Domenico .
ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2014, 8 (01) :5-26
[7]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[8]  
Davis JS, 2006, PROCEEDINGS OF THE 1ST INTERNATIONAL CONFERENCE ON THE ECOLOGICAL IMPORTANCE OF SOLAR SALTWORKS, P5
[9]  
Fawcett T, 2004, Machine Learning, V31, P1
[10]   An introduction to ROC analysis [J].
Fawcett, Tom .
PATTERN RECOGNITION LETTERS, 2006, 27 (08) :861-874