Comparison of Cluster-Based Sampling Approaches for Imbalanced Data of Crashes Involving Large Trucks

被引：3

作者：

Tahfim, Syed As-Sadeq ^{[1
]}

Chen, Yan ^{[1
]}

机构：

[1] Dalian Maritime Univ, Sch Maritime Econ & Management, Dalian 116026, Peoples R China

来源：

INFORMATION | 2024年 / 15卷 / 03期

关键词：

imbalanced crash data; cluster-based under-sampling; ADASYN; NearMiss-2; SMOTETomek; machine learning models; INJURY SEVERITY; TRAFFIC ACCIDENTS; CLASSIFICATION;

D O I：

10.3390/info15030145

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Severe and fatal crashes involving large trucks result in significant social and economic losses for human society. Unfortunately, the notably low proportion of severe and fatal injury crashes involving large trucks creates an imbalance in crash data. Models trained on imbalanced crash data are likely to produce erroneous results. Therefore, there is a need to explore novel sampling approaches for imbalanced crash data, and it is crucial to determine the appropriate combination of a machine learning model, sampling approach, and ratio. This study introduces a novel cluster-based under-sampling technique, utilizing the k-prototypes clustering algorithm. After initial cluster-based under-sampling, the consolidated cluster-based under-sampled data set was further resampled using three different sampling approaches (i.e., adaptive synthetic sampling (ADASYN), NearMiss-2, and the synthetic minority oversampling technique + Tomek links (SMOTETomek)). Later, four machine learning models (logistic regression (LR), random forest (RF), gradient-boosted decision trees (GBDT), and the multi-layer perceptron (MLP) neural network) were trained and evaluated using the geometric mean (G-Mean) and area under the receiver operating characteristic curve (AUC) scores. The findings suggest that cluster-based under-sampling coupled with the investigated sampling approaches improve the performance of the machine learning models developed on crash data significantly. In addition, the GBDT model combined with ADASYN or SMOTETomek is likely to yield better predictions than any model combined with NearMiss-2. Regarding changes in sampling ratios, increasing the sampling ratio with ADASYN and SMOTETomek is likely to improve the performance of models up to a certain level, whereas with NearMiss-2, performance is likely to drop significantly beyond a specific point. These findings provide valuable insights for selecting optimal strategies for treating the class imbalance issue in crash data.

引用

页数：18

共 50 条

[1] Cluster-based sampling approaches to imbalanced data distributions
Yen, Show-Jane
Lee, Yue-Shi
DATA WAREHOUSING AND KNOWLEDGE DISCOVERY, PROCEEDINGS, 2006, 4081 : 427 - 436
[2] A Cluster-Based Approach for Analysis of Injury Severity in Interstate Crashes Involving Large Trucks
Tahfim, Syed As-Sadeq
Chen, Yan
SUSTAINABILITY, 2022, 14 (21)
[3] Cluster-based under-sampling approaches for imbalanced data distributions
Yen, Show-Jane
Lee, Yue-Shi
EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (03) : 5718 - 5727
[4] Cluster-based sampling of multiclass imbalanced data
Prachuabsupakij, Wanthanee
Soonthornphisaj, Nuanwan
INTELLIGENT DATA ANALYSIS, 2014, 18 (06) : 1109 - 1135
[5] A cluster-based hybrid sampling approach for imbalanced data classification
Feng, Shou
Zhao, Chunhui
Fu, Ping
REVIEW OF SCIENTIFIC INSTRUMENTS, 2020, 91 (05):
[6] A Cluster-Based Under-Sampling Algorithm for Class-Imbalanced Data
Guzman-Ponce, A.
Valdovinos, R. M.
Sanchez, J. S.
HYBRID ARTIFICIAL INTELLIGENT SYSTEMS, HAIS 2020, 2020, 12344 : 299 - 311
[7] Cluster-Based Minority Over-Sampling for Imbalanced Datasets
Puntumapon, Kamthorn
Rakthamamon, Thanawin
Waiyamai, Kitsana
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2016, E99D (12): : 3101 - 3109
[8] A Cluster-based Regrouping Approach for Imbalanced Data Distributions
Yu, Wen
Jiang, ShengYi
2012 WORLD AUTOMATION CONGRESS (WAC), 2012,
[9] Cluster-Based Instance Selection for the Imbalanced Data Classification
Czarnowski, Ireneusz
Jedrzejowicz, Piotr
COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2018, PT II, 2018, 11056 : 191 - 200
[10] A cluster-based SMOTE both-sampling (CSBBoost) ensemble algorithm for classifying imbalanced data
Amir Reza Salehi
Majid Khedmati
Scientific Reports, 14

← 1 2 3 4 5 →