Oversampling method via adaptive double weights and Gaussian kernel function for the transformation of unbalanced data in risk assessment of cardiovascular disease

被引:12
作者
Rao, Congjun [1 ]
Wei, Xi [1 ]
Xiao, Xinping [1 ]
Shi, Yu [1 ]
Goh, Mark [2 ,3 ]
机构
[1] Wuhan Univ Technol, Sch Sci, Wuhan 430070, Peoples R China
[2] Natl Univ Singapore, NUS Business Sch, Singapore 119623, Singapore
[3] Natl Univ Singapore, Logist Inst Asia Pacific, Singapore 119623, Singapore
基金
中国国家自然科学基金;
关键词
Cardiovascular disease; Unbalanced data; Double weights; Gaussian kernel function; ADWGKFO; ENSEMBLE; SMOTE; PREDICTION; ALGORITHM;
D O I
10.1016/j.ins.2024.120410
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In risk assessment of cardiovascular disease (CVD), the classification error caused by unbalanced data is a significant challenge, which has sparked widespread concern and research upsurge in the field of data mining. Therefore, in view of the imbalance of CVD data sets, an oversampling method via adaptive double weights and Gaussian kernel function (ADWGKFO) is proposed, which converts the unbalanced data sets into balanced data sets. Firstly, clustering algorithm is utilized to cluster minority samples, boundary samples are identified by Borderline -Synthetic Minority Over -sampling Technique (Borderline -SMOTE), K nearest neighbor and support vector machine algorithms, and the number of samples synthesized in each group is calculated based on the double weights of boundary points and majority distribution. Secondly, in order to clearly define the classification boundary, the mutual class potential of new samples in each cluster is calculated by Gaussian kernel function, and new samples are filtered according to the mutual class potential until the data set is balanced. Finally, taking the data sets from Kaggle platform as the research samples, the proposed method is empirically analyzed. In order to validate the efficacy and universality of the proposed method, this paper selects CatBoost that is a new integrated algorithm to test the effect of the ADWGKFO method, and compares it with different sampling methods and different classifiers using performance evaluation indexes such as accuracy, F1 -score and area under the curve (AUC). Compared with the combinations of other methods, the accuracy, F1 -score and AUC are significantly improved. It is concluded that the ADWGKFO method described in this paper can successfully improve the data quality, and increases the reliability of CVD risk assessment.
引用
收藏
页数:16
相关论文
共 50 条
[1]   A smart healthcare monitoring system for heart disease prediction based on ensemble deep learning and feature fusion [J].
Ali, Farman ;
El-Sappagh, Shaker ;
Islam, S. M. Riazul ;
Kwak, Daehan ;
Ali, Amjad ;
Imran, Muhammad ;
Kwak, Kyung-Sup .
INFORMATION FUSION, 2020, 63 :208-222
[2]   RN-SMOTE: Reduced Noise SMOTE based on DBSCAN for enhancing imbalanced data classification [J].
Arafa, Ahmed ;
El-Fishawy, Nawal ;
Badawy, Mohammed ;
Radad, Marwa .
JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2022, 34 (08) :5059-5074
[3]   MWMOTE-Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning [J].
Barua, Sukarna ;
Islam, Md. Monirul ;
Yao, Xin ;
Murase, Kazuyuki .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (02) :405-425
[4]   The Multiclass ROC Front method for cost-sensitive classification [J].
Bernard, Simon ;
Chatelain, Clement ;
Adam, Sebastien ;
Sabourin, Robert .
PATTERN RECOGNITION, 2016, 52 :46-60
[5]   Evaluating the validity of class balancing algorithms-based machine learning models for geogenic contaminated groundwaters prediction [J].
Cao, Hailong ;
Xie, Xianjun ;
Shi, Jianbo ;
Wang, Yanxin .
JOURNAL OF HYDROLOGY, 2022, 610
[6]   Financial forecasting using support vector machines [J].
Cao, L ;
Tay, FEH .
NEURAL COMPUTING & APPLICATIONS, 2001, 10 (02) :184-192
[7]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[8]   Cross-Domain Feature learning and data augmentation for few-shot proxy development in oil industry [J].
Cirac, Gabriel ;
Farfan, Jeanfranco ;
Avansi, Guilherme Daniel ;
Schiozer, Denis Jose ;
Rocha, Anderson .
APPLIED SOFT COMPUTING, 2023, 149
[9]   A simple plug-in bagging ensemble based on threshold-moving for classifying binary and multiclass imbalanced data [J].
Collell, Guillem ;
Prelec, Drazen ;
Patil, Kaustubh R. .
NEUROCOMPUTING, 2018, 275 :330-340
[10]   Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE [J].
Douzas, Georgios ;
Bacao, Fernando ;
Last, Felix .
INFORMATION SCIENCES, 2018, 465 :1-20