共 50 条
Oversampling method via adaptive double weights and Gaussian kernel function for the transformation of unbalanced data in risk assessment of cardiovascular disease
被引:12
作者:
Rao, Congjun
[1
]
Wei, Xi
[1
]
Xiao, Xinping
[1
]
Shi, Yu
[1
]
Goh, Mark
[2
,3
]
机构:
[1] Wuhan Univ Technol, Sch Sci, Wuhan 430070, Peoples R China
[2] Natl Univ Singapore, NUS Business Sch, Singapore 119623, Singapore
[3] Natl Univ Singapore, Logist Inst Asia Pacific, Singapore 119623, Singapore
基金:
中国国家自然科学基金;
关键词:
Cardiovascular disease;
Unbalanced data;
Double weights;
Gaussian kernel function;
ADWGKFO;
ENSEMBLE;
SMOTE;
PREDICTION;
ALGORITHM;
D O I:
10.1016/j.ins.2024.120410
中图分类号:
TP [自动化技术、计算机技术];
学科分类号:
0812 ;
摘要:
In risk assessment of cardiovascular disease (CVD), the classification error caused by unbalanced data is a significant challenge, which has sparked widespread concern and research upsurge in the field of data mining. Therefore, in view of the imbalance of CVD data sets, an oversampling method via adaptive double weights and Gaussian kernel function (ADWGKFO) is proposed, which converts the unbalanced data sets into balanced data sets. Firstly, clustering algorithm is utilized to cluster minority samples, boundary samples are identified by Borderline -Synthetic Minority Over -sampling Technique (Borderline -SMOTE), K nearest neighbor and support vector machine algorithms, and the number of samples synthesized in each group is calculated based on the double weights of boundary points and majority distribution. Secondly, in order to clearly define the classification boundary, the mutual class potential of new samples in each cluster is calculated by Gaussian kernel function, and new samples are filtered according to the mutual class potential until the data set is balanced. Finally, taking the data sets from Kaggle platform as the research samples, the proposed method is empirically analyzed. In order to validate the efficacy and universality of the proposed method, this paper selects CatBoost that is a new integrated algorithm to test the effect of the ADWGKFO method, and compares it with different sampling methods and different classifiers using performance evaluation indexes such as accuracy, F1 -score and area under the curve (AUC). Compared with the combinations of other methods, the accuracy, F1 -score and AUC are significantly improved. It is concluded that the ADWGKFO method described in this paper can successfully improve the data quality, and increases the reliability of CVD risk assessment.
引用
收藏
页数:16
相关论文