Comparison of Cluster-Based Sampling Approaches for Imbalanced Data of Crashes Involving Large Trucks

被引:3
|
作者
Tahfim, Syed As-Sadeq [1 ]
Chen, Yan [1 ]
机构
[1] Dalian Maritime Univ, Sch Maritime Econ & Management, Dalian 116026, Peoples R China
关键词
imbalanced crash data; cluster-based under-sampling; ADASYN; NearMiss-2; SMOTETomek; machine learning models; INJURY SEVERITY; TRAFFIC ACCIDENTS; CLASSIFICATION;
D O I
10.3390/info15030145
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Severe and fatal crashes involving large trucks result in significant social and economic losses for human society. Unfortunately, the notably low proportion of severe and fatal injury crashes involving large trucks creates an imbalance in crash data. Models trained on imbalanced crash data are likely to produce erroneous results. Therefore, there is a need to explore novel sampling approaches for imbalanced crash data, and it is crucial to determine the appropriate combination of a machine learning model, sampling approach, and ratio. This study introduces a novel cluster-based under-sampling technique, utilizing the k-prototypes clustering algorithm. After initial cluster-based under-sampling, the consolidated cluster-based under-sampled data set was further resampled using three different sampling approaches (i.e., adaptive synthetic sampling (ADASYN), NearMiss-2, and the synthetic minority oversampling technique + Tomek links (SMOTETomek)). Later, four machine learning models (logistic regression (LR), random forest (RF), gradient-boosted decision trees (GBDT), and the multi-layer perceptron (MLP) neural network) were trained and evaluated using the geometric mean (G-Mean) and area under the receiver operating characteristic curve (AUC) scores. The findings suggest that cluster-based under-sampling coupled with the investigated sampling approaches improve the performance of the machine learning models developed on crash data significantly. In addition, the GBDT model combined with ADASYN or SMOTETomek is likely to yield better predictions than any model combined with NearMiss-2. Regarding changes in sampling ratios, increasing the sampling ratio with ADASYN and SMOTETomek is likely to improve the performance of models up to a certain level, whereas with NearMiss-2, performance is likely to drop significantly beyond a specific point. These findings provide valuable insights for selecting optimal strategies for treating the class imbalance issue in crash data.
引用
收藏
页数:18
相关论文
共 50 条
  • [1] Cluster-based sampling approaches to imbalanced data distributions
    Yen, Show-Jane
    Lee, Yue-Shi
    DATA WAREHOUSING AND KNOWLEDGE DISCOVERY, PROCEEDINGS, 2006, 4081 : 427 - 436
  • [2] A Cluster-Based Approach for Analysis of Injury Severity in Interstate Crashes Involving Large Trucks
    Tahfim, Syed As-Sadeq
    Chen, Yan
    SUSTAINABILITY, 2022, 14 (21)
  • [3] Cluster-based under-sampling approaches for imbalanced data distributions
    Yen, Show-Jane
    Lee, Yue-Shi
    EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (03) : 5718 - 5727
  • [4] Cluster-based sampling of multiclass imbalanced data
    Prachuabsupakij, Wanthanee
    Soonthornphisaj, Nuanwan
    INTELLIGENT DATA ANALYSIS, 2014, 18 (06) : 1109 - 1135
  • [5] A cluster-based hybrid sampling approach for imbalanced data classification
    Feng, Shou
    Zhao, Chunhui
    Fu, Ping
    REVIEW OF SCIENTIFIC INSTRUMENTS, 2020, 91 (05):
  • [6] A Cluster-Based Under-Sampling Algorithm for Class-Imbalanced Data
    Guzman-Ponce, A.
    Valdovinos, R. M.
    Sanchez, J. S.
    HYBRID ARTIFICIAL INTELLIGENT SYSTEMS, HAIS 2020, 2020, 12344 : 299 - 311
  • [7] Cluster-Based Minority Over-Sampling for Imbalanced Datasets
    Puntumapon, Kamthorn
    Rakthamamon, Thanawin
    Waiyamai, Kitsana
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2016, E99D (12): : 3101 - 3109
  • [8] A Cluster-based Regrouping Approach for Imbalanced Data Distributions
    Yu, Wen
    Jiang, ShengYi
    2012 WORLD AUTOMATION CONGRESS (WAC), 2012,
  • [9] Cluster-Based Instance Selection for the Imbalanced Data Classification
    Czarnowski, Ireneusz
    Jedrzejowicz, Piotr
    COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2018, PT II, 2018, 11056 : 191 - 200
  • [10] A cluster-based SMOTE both-sampling (CSBBoost) ensemble algorithm for classifying imbalanced data
    Amir Reza Salehi
    Majid Khedmati
    Scientific Reports, 14