Robust weighted kernel logistic regression in imbalanced and rare events data

被引:89
作者
Maalouf, Maher [1 ]
Trafalis, Theodore B. [1 ]
机构
[1] Univ Oklahoma, Sch Ind Engn, Norman, OK 73019 USA
关键词
Classification; Endogenous sampling; Logistic regression; Kernel methods; Truncated Newton; LIKELIHOOD; MODEL;
D O I
10.1016/j.csda.2010.06.014
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Recent developments in computing and technology, along with the availability of large amounts of raw data, have contributed to the creation of many effective techniques and algorithms in the fields of pattern recognition and machine learning. The main objectives for developing these algorithms include identifying patterns within the available data or making predictions, or both. Great success has been achieved with many classification techniques in real-life applications. With regard to binary data classification in particular, analysis of data containing rare events or disproportionate class distributions poses a great challenge to industry and to the machine learning community. This study examines rare events (REs) with binary dependent variables containing many more non-events (zeros) than events (ones). These variables are difficult to predict and to explain as has been evidenced in the literature. This research combines rare events corrections to Logistic Regression (LR) with truncated Newton methods and applies these techniques to Kernel Logistic Regression (KLR). The resulting model, Rare Event Weighted Kernel Logistic Regression (RE-WKLR), is a combination of weighting, regularization, approximate numerical methods, kernelization, bias correction, and efficient implementation, all of which are critical to enabling RE-WKLR to be an effective and powerful method for predicting rare events. Comparing RE-WKLR to SVM and TR-KLR, using non-linearly separable, small and large binary rare event datasets, we find that RE-WKLR is as fast as TR-KLR and much faster than SVM. In addition, according to the statistical significance test, RE-WKLR is more accurate than both SVM and TR-KLR. (C) 2010 Elsevier B.V. All rights reserved.
引用
收藏
页码:168 / 183
页数:16
相关论文
共 65 条
[1]  
Amemiya Takeshi, 1985, Advanced Econometrics
[2]  
[Anonymous], 1993, An introduction to the bootstrap
[3]  
[Anonymous], 2003, ICML-2003 Workshop on Learning from Imbalanced Data Sets II
[4]  
[Anonymous], 2007, Uci machine learning repository
[5]  
[Anonymous], 1983, Generalized Linear Models
[6]  
Bai S.B., 2008, FUZZY SYSTEMS KNOWLE, V4, P647
[7]  
Ben-Akiva M. E., 1985, Discrete choice analysis: Theory and application to travel demand, V9
[8]  
Berk RA, 2008, SPRINGER SER STAT, P1, DOI 10.1007/978-0-387-77501-2_1
[9]  
Busser B., 1999, P 6 EUROPEAN C SPEEC, P2123
[10]  
Cameron A. C., 2005, Microeconometrics: Methods and Applications