An Integrated machine learning and DEA-predefined performance outcome prediction framework with high-dimensional imbalanced data

被引:3
作者
Shi, Yu [1 ,3 ]
Zhao, Wei [2 ]
机构
[1] Drake Univ, Coll Business & Publ Adm, Des Moines, IA USA
[2] Worcester Polytech Inst, Dept Biomed Engn, Worcester, MA USA
[3] Drake Univ, Coll Business & Publ Adm, Des Moines, IA 50311 USA
关键词
Data envelopment analysis; machine learning; feature selection; performance evaluation; contextual variables; DATA ENVELOPMENT ANALYSIS; BANK BRANCH EFFICIENCY; CREDIT-RISK; BANKRUPTCY PREDICTION; OPERATING EFFICIENCY; FINANCIAL RATIOS; SMOTE; OUTLIERS; MODEL;
D O I
10.1080/03155986.2023.2168943
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In performance evaluation, emerging studies utilize machine learning to increase the interpretability and robustness of data envelopment analysis (DEA), a non-parametric tool for assessing the relative performance of decision-making units (DMUs). In these studies, the machine learning dynamics typically do not replicate the DEA process in terms of directly labeling DMUs based on their relative performance. Practically, there is no standardized methodological framework that serves this purpose. We propose a data-driven and computationally efficient system that imitates DEA and predicts performance outcomes, which are grouped into several classes. First, a DEA composite index was constructed, and the subsequent DEA scores were labeled as the good, the acceptable, and the underperforming classes. Next, synthetic minority oversampling technique (SMOTE) with Manhattan distance metric was used to solve class imbalance in the labeled, high-dimensional dataset. The framework was built using different classifiers, including random forest, support vector machine, and logistic regression, to verify that the framework is not model-dependent. They achieved comparable recall rates (82.70%-95.39%). Moreover, the impacts of contextual variables on DMU performance were unveiled using model-based feature selection and logistic regression. The framework was tested on a banking dataset and an independent dataset containing the electronics, service, and retail industries.
引用
收藏
页码:100 / 129
页数:30
相关论文
共 50 条
  • [11] Cost-sensitive learning strategies for high-dimensional and imbalanced data: a comparative study
    Pes, Barbara
    Lai, Giuseppina
    PEERJ COMPUTER SCIENCE, 2021, 7
  • [12] Cost-sensitive learning strategies for high-dimensional and imbalanced data: a comparative study
    Pes B.
    Lai G.
    Pes, Barbara (pes@unica.it), 1600, PeerJ Inc. (07):
  • [13] Customer purchase prediction from the perspective of imbalanced data: A machine learning framework based on factorization machine
    Chen, Shui-xia
    Wang, Xiao-kang
    Zhang, Hong-yu
    Wang, Jian-qiang
    EXPERT SYSTEMS WITH APPLICATIONS, 2021, 173
  • [14] An Efficient Machine Learning Framework for Stress Prediction via Sensor Integrated Keyboard Data
    Pankajavalli, P. B.
    Karthick, G. S.
    Sakthivel, R.
    IEEE ACCESS, 2021, 9 : 95023 - 95035
  • [15] Novel machine learning approach for classification of high-dimensional microarray data
    Musheer, Rabia Aziz
    Verma, C. K.
    Srivastava, Namita
    SOFT COMPUTING, 2019, 23 (24) : 13409 - 13421
  • [16] HIBoost: A hubness-aware ensemble learning algorithm for high-dimensional imbalanced data classification
    Wu, Qin
    Lin, Yaping
    Zhu, Tuanfei
    Zhang, Yue
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2020, 39 (01) : 133 - 144
  • [17] Finding causative genes from high-dimensional data: an appraisal of statistical and machine learning approaches
    Wang, Chamont
    Gevertz, Jana L.
    STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY, 2016, 15 (04) : 321 - 347
  • [18] Predicting financial distress in high-dimensional imbalanced datasets: a multi-heterogeneous self-paced ensemble learning framework
    Gao, Ruize
    Cui, Shaoze
    Wang, Yu
    Xu, Wei
    FINANCIAL INNOVATION, 2025, 11 (01)
  • [19] B2FSE framework for high dimensional imbalanced data: A case study for drug toxicity prediction
    Hooda, Nishtha
    Bawa, Seema
    Rana, Prashant Singh
    NEUROCOMPUTING, 2018, 276 : 31 - 41
  • [20] A Sparse Learning Machine for High-Dimensional Data with Application to Microarray Gene Analysis
    Cheng, Qiang
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2010, 7 (04) : 636 - 646