Improving logistic regression on the imbalanced data by a novel penalized log-likelihood function

被引:15
|
作者
Zhang, Lili [1 ]
Geisler, Trent [1 ]
Ray, Herman [2 ]
Xie, Ying [3 ]
机构
[1] Kennesaw State Univ, Analyt & Data Sci PhD Program, Kennesaw, GA 30144 USA
[2] Kennesaw State Univ, Analyt & Data Sci Inst, Kennesaw, GA 30144 USA
[3] Kennesaw State Univ, Dept Informat Technol, Kennesaw, GA 30144 USA
关键词
Logistic regression; binary classification; imbalanced data; maximum likelihood; penalized log-likelihood function; cost-sensitive; CLASSIFICATION; BINARY;
D O I
10.1080/02664763.2021.1939662
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Logistic regression is estimated by maximizing the log-likelihood objective function formulated under the assumption of maximizing the overall accuracy. That does not apply to the imbalanced data. The resulting models tend to be biased towards the majority class (i.e. non-event), which can bring great loss in practice. One strategy for mitigating such bias is to penalize the misclassification costs of observations differently in the log-likelihood function. Existing solutions require either hard hyperparameter estimating or high computational complexity. We propose a novel penalized log-likelihood function by including penalty weights as decision variables for observations in the minority class (i.e. event) and learning them from data along with model coefficients. In the experiments, the proposed logistic regression model is compared with the existing ones on the statistics of area under receiver operating characteristics (ROC) curve from 10 public datasets and 16 simulated datasets, as well as the training time. A detailed analysis is conducted on an imbalanced credit dataset to examine the estimated probability distributions, additional performance measurements (i.e. type I error and type II error) and model coefficients. The results demonstrate that both the discrimination ability and computation efficiency of logistic regression models are improved using the proposed log-likelihood function as the learning objective.
引用
收藏
页码:3257 / 3277
页数:21
相关论文
共 50 条
  • [21] Maximum softly-penalized likelihood for mixed effects logistic regression
    Philipp Sterzinger
    Ioannis Kosmidis
    Statistics and Computing, 2023, 33
  • [22] Estimating the expectation of the log-likelihood with censored data for estimator selection
    Liquet, B
    Commenges, D
    LIFETIME DATA ANALYSIS, 2004, 10 (04) : 351 - 367
  • [23] Improving Prediction Accuracy for Logistic Regression On Imbalanced Datasets
    Zhang, Hao
    Li, Zhuolin
    Shahriar, Hossain
    Tao, Lixin
    Bhattacharya, Prabir
    Qian, Ying
    2019 IEEE 43RD ANNUAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE (COMPSAC), VOL 1, 2019, : 918 - 919
  • [24] Asymptotic properties of a double penalized maximum likelihood estimator in logistic regression
    Gao, Sujuan
    Shen, Jianzhao
    STATISTICS & PROBABILITY LETTERS, 2007, 77 (09) : 925 - 930
  • [25] Smoothing the Lee-Carter and Poisson log-bilinear models for mortality forecasting: a penalized log-likelihood approach
    Delwarde, Antoine
    Denuit, Michel
    Eilers, Paul
    STATISTICAL MODELLING, 2007, 7 (01) : 29 - 48
  • [26] Penalized likelihood and Bayesian function selection in regression models
    Fabian Scheipl
    Thomas Kneib
    Ludwig Fahrmeir
    AStA Advances in Statistical Analysis, 2013, 97 : 349 - 385
  • [27] Regression model selection via log-likelihood ratio and constrained minimum criterion
    Tsao, Min
    CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE, 2024, 52 (01): : 195 - 211
  • [28] Penalized likelihood and Bayesian function selection in regression models
    Scheipl, Fabian
    Kneib, Thomas
    Fahrmeir, Ludwig
    ASTA-ADVANCES IN STATISTICAL ANALYSIS, 2013, 97 (04) : 349 - 385
  • [29] A log-likelihood function-based algorithm for QAM signal classification
    Yang, YP
    Liu, CH
    Soong, TW
    SIGNAL PROCESSING, 1998, 70 (01) : 61 - 71
  • [30] A generalization of the log-likelihood function and weighted average in Gauss' law of error
    Wada, Tatsuaki
    Suyari, Hiroki
    2008 INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY AND ITS APPLICATIONS, VOLS 1-3, 2008, : 1311 - +