An Integrated machine learning and DEA-predefined performance outcome prediction framework with high-dimensional imbalanced data

被引:3
作者
Shi, Yu [1 ,3 ]
Zhao, Wei [2 ]
机构
[1] Drake Univ, Coll Business & Publ Adm, Des Moines, IA USA
[2] Worcester Polytech Inst, Dept Biomed Engn, Worcester, MA USA
[3] Drake Univ, Coll Business & Publ Adm, Des Moines, IA 50311 USA
关键词
Data envelopment analysis; machine learning; feature selection; performance evaluation; contextual variables; DATA ENVELOPMENT ANALYSIS; BANK BRANCH EFFICIENCY; CREDIT-RISK; BANKRUPTCY PREDICTION; OPERATING EFFICIENCY; FINANCIAL RATIOS; SMOTE; OUTLIERS; MODEL;
D O I
10.1080/03155986.2023.2168943
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In performance evaluation, emerging studies utilize machine learning to increase the interpretability and robustness of data envelopment analysis (DEA), a non-parametric tool for assessing the relative performance of decision-making units (DMUs). In these studies, the machine learning dynamics typically do not replicate the DEA process in terms of directly labeling DMUs based on their relative performance. Practically, there is no standardized methodological framework that serves this purpose. We propose a data-driven and computationally efficient system that imitates DEA and predicts performance outcomes, which are grouped into several classes. First, a DEA composite index was constructed, and the subsequent DEA scores were labeled as the good, the acceptable, and the underperforming classes. Next, synthetic minority oversampling technique (SMOTE) with Manhattan distance metric was used to solve class imbalance in the labeled, high-dimensional dataset. The framework was built using different classifiers, including random forest, support vector machine, and logistic regression, to verify that the framework is not model-dependent. They achieved comparable recall rates (82.70%-95.39%). Moreover, the impacts of contextual variables on DMU performance were unveiled using model-based feature selection and logistic regression. The framework was tested on a banking dataset and an independent dataset containing the electronics, service, and retail industries.
引用
收藏
页码:100 / 129
页数:30
相关论文
共 50 条
[21]   Handling high-dimensional data with missing values by modern machine learning techniques [J].
Chen, Sixia ;
Xu, Chao .
JOURNAL OF APPLIED STATISTICS, 2023, 50 (03) :786-804
[22]   Machine Learning Approximation for Rapid Prediction of High-Dimensional Storm Surge and Wave Responses [J].
Naeini, Saeed Saviz ;
Snaiki, Reda .
PROCEEDINGS OF THE CANADIAN SOCIETY OF CIVIL ENGINEERING ANNUAL CONFERENCE 2022, VOL 1, CSCE 2022, 2023, 363 :701-710
[23]   Forecasting bilateral asylum seeker flows with high-dimensional data and machine learning techniques [J].
Boss, Konstantin ;
Groeger, Andre ;
Heidland, Tobias ;
Krueger, Finja ;
Zheng, Conghan .
JOURNAL OF ECONOMIC GEOGRAPHY, 2024, 25 (01) :3-19
[24]   The effect of principal component analysis on machine learning accuracy with high-dimensional spectral data [J].
Howley, Tom ;
Madden, Michael G. ;
O'Connell, Marie-Louise ;
Ryder, Alan G. .
KNOWLEDGE-BASED SYSTEMS, 2006, 19 (05) :363-370
[25]   Pooling and winsorizing machine learning forecasts to predict stock returns with high-dimensional data [J].
Mekelburg, Erik ;
Strauss, Jack .
JOURNAL OF EMPIRICAL FINANCE, 2024, 79
[26]   Employee benefits and company performance: Evidence from a high-dimensional machine learning model [J].
Ranta, Mikko ;
Ylinen, Mika .
MANAGEMENT ACCOUNTING RESEARCH, 2024, 64
[27]   Hyperparameter Tuning with High Performance Computing Machine Learning for Imbalanced Alzheimer's Disease Data [J].
Zhang, Fan ;
Petersen, Melissa ;
Johnson, Leigh ;
Hall, James ;
O'Bryant, Sid E. .
APPLIED SCIENCES-BASEL, 2022, 12 (13)
[28]   A machine learning framework for solving high-dimensional mean field game and mean field control problems [J].
Ruthotto, Lars ;
Osher, Stanley J. ;
Li, Wuchen ;
Nurbekyan, Levon ;
Fung, Samy Wu .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2020, 117 (17) :9183-9193
[29]   Accurate classification of depression through optimized machine learning models on high-dimensional noisy data [J].
Fang, Xingang ;
Klawohn, Julia ;
De Sabatino, Alexander ;
Kundnani, Harsh ;
Ryan, Jonathan ;
Yu, Weikuan ;
Hajcak, Greg .
BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2022, 71
[30]   Using Machine Learning Methods to Develop a Short Tree-Based Adaptive Classification Test: Case Study With a High-Dimensional Item Pool and Imbalanced Data [J].
Zheng, Yi ;
Cheon, Hyunjung ;
Katz, Charles M. .
APPLIED PSYCHOLOGICAL MEASUREMENT, 2020, 44 (7-8) :499-514