Improving logistic regression on the imbalanced data by a novel penalized log-likelihood function

被引:15
|
作者
Zhang, Lili [1 ]
Geisler, Trent [1 ]
Ray, Herman [2 ]
Xie, Ying [3 ]
机构
[1] Kennesaw State Univ, Analyt & Data Sci PhD Program, Kennesaw, GA 30144 USA
[2] Kennesaw State Univ, Analyt & Data Sci Inst, Kennesaw, GA 30144 USA
[3] Kennesaw State Univ, Dept Informat Technol, Kennesaw, GA 30144 USA
关键词
Logistic regression; binary classification; imbalanced data; maximum likelihood; penalized log-likelihood function; cost-sensitive; CLASSIFICATION; BINARY;
D O I
10.1080/02664763.2021.1939662
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Logistic regression is estimated by maximizing the log-likelihood objective function formulated under the assumption of maximizing the overall accuracy. That does not apply to the imbalanced data. The resulting models tend to be biased towards the majority class (i.e. non-event), which can bring great loss in practice. One strategy for mitigating such bias is to penalize the misclassification costs of observations differently in the log-likelihood function. Existing solutions require either hard hyperparameter estimating or high computational complexity. We propose a novel penalized log-likelihood function by including penalty weights as decision variables for observations in the minority class (i.e. event) and learning them from data along with model coefficients. In the experiments, the proposed logistic regression model is compared with the existing ones on the statistics of area under receiver operating characteristics (ROC) curve from 10 public datasets and 16 simulated datasets, as well as the training time. A detailed analysis is conducted on an imbalanced credit dataset to examine the estimated probability distributions, additional performance measurements (i.e. type I error and type II error) and model coefficients. The results demonstrate that both the discrimination ability and computation efficiency of logistic regression models are improved using the proposed log-likelihood function as the learning objective.
引用
收藏
页码:3257 / 3277
页数:21
相关论文
共 50 条
  • [41] A modification of logistic regression with imbalanced data: F-measure-oriented Lasso-logistic regression
    My, Bui T. T.
    Ta, Bao Q.
    SCIENCEASIA, 2023, 49 : 68 - 77
  • [42] Predictive Performance of Logistic Regression for Imbalanced Data with Categorical Covariate
    Abd Rahman, Hezlin Aryani
    Wah, Yap Bee
    Huat, Ong Seng
    PERTANIKA JOURNAL OF SCIENCE AND TECHNOLOGY, 2021, 29 (01): : 181 - 197
  • [43] Predictive Performance of Logistic Regression for Imbalanced Data with Categorical Covariate
    Abd Rahman, Hezlin Aryani
    Wah, Yap Bee
    Huat, Ong Seng
    PERTANIKA JOURNAL OF SCIENCE AND TECHNOLOGY, 2020, 28 (04): : 1141 - 1161
  • [44] New Algorithms for Evaluating the Log-Likelihood Function Derivatives in the AI-REML Method
    Mishchenko, Kateryna
    Neytcheva, Maya
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2009, 38 (06) : 1348 - 1364
  • [45] CONVERGENCE OF BACK-PROPAGATION IN NEURAL NETWORKS USING A LOG-LIKELIHOOD COST FUNCTION
    HOLT, MJJ
    SEMNANI, S
    ELECTRONICS LETTERS, 1990, 26 (23) : 1964 - 1965
  • [46] Likelihood-Based Inference of Log-Logistic Accelerated Hazards Regression Models for Cross Hazards Data
    Ko, Nak Gyeong
    Gwen, Dana
    Ha, Il Do
    MEASUREMENT-INTERDISCIPLINARY RESEARCH AND PERSPECTIVES, 2024,
  • [47] Estimation in the Three-Parameter Gamma Distribution Based on the Profile Log-Likelihood Function
    Tzavelas, George
    COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2009, 38 (05) : 573 - 583
  • [48] Approximate and Pseudo-Likelihood Analysis for Logistic Regression Using External Validation Data to Model Log Exposure
    Lyles, Robert H.
    Kupper, Lawrence L.
    JOURNAL OF AGRICULTURAL BIOLOGICAL AND ENVIRONMENTAL STATISTICS, 2013, 18 (01) : 22 - 38
  • [49] Approximate and Pseudo-Likelihood Analysis for Logistic Regression Using External Validation Data to Model Log Exposure
    Robert H. Lyles
    Lawrence L. Kupper
    Journal of Agricultural, Biological, and Environmental Statistics, 2013, 18 : 22 - 38
  • [50] High dimensional model representation of log-likelihood ratio: binary classification with expression data
    Foroughi Pour, Ali
    Pietrzak, Maciej
    Dalton, Lori A.
    Rempala, Grzegorz A.
    BMC BIOINFORMATICS, 2020, 21 (01)