Machine learning in public health informatics: Evidence that complex sampling structures may not be needed for prediction models with imbalanced outcomes

被引：0

作者：

Si, Zhengye ^{[1
]}

Li, Jinpu ^{[1
]}

Leary, Emily ^{[1
]}

机构：

[1] Univ Missouri, Sch Med, Dept Orthopaed Surg, Thompson Lab Regenerat Orthopaed, 1100 Virginia Ave, Columbia, MO 65211 USA

来源：

ANNALS OF EPIDEMIOLOGY | 2025年 / 102卷

关键词：

Public health informatics; Machine learning; NationaL Surveys; Data collection; Methods; REGULARIZATION PATHS; NATIONAL-HEALTH;

D O I：

10.1016/j.annepidem.2024.12.016

中图分类号：

R1 [预防医学、卫生学];

学科分类号：

1004 ; 120402 ;

摘要：

Purpose: The objective of this study is to investigate the predictive ability of machine learning models for imbalanced outcomes from national survey data without the use of sampling weights. Methods: We evaluated the predictive performance of machine learning models on imbalanced outcomes from the US National Health and Nutrition Examination Survey (USNHANES) without using sampling weights. Four machine learning models (support vector machine, random forest, least absolute shrinkage and selection operator regression, and deep neural network) were compared with a logistic model that incorporates the survey's complex sampling design. Three resampling methods (oversampling, undersampling, and combined) were used to address class imbalance during the model training process. Results: For all models, the balanced accuracy was similar (ranging from 0.72 to 0.76) and the specificity was smaller than sensitivity except for random forest. The support vector machine and neural networks performed best with sensitivity (ranging from 0.79 to 0.83), while the random forest had the largest specificity (ranging from 0.86 to 0.96), with one exception. PR-AUC scores and Brier scores were low ranging from 0.2529 to 0.3313 (lower scores worse) and 0.1005-0.3245 (lower scores better), respectively Conclusions: The machine learning models had overall similar predictive capacity to the recommended methods which integrate the complex sampling design for the prediction of osteoarthritis occurrence with USNHANES.

引用

页码：75 / 80

页数：6

共 43 条

[1] [Anonymous], 2017, Applied survey data analysis, DOI [10.1201/9781315153278, DOI 10.1201/9781315153278, 10.1201/9781420053098, DOI 10.1201/9781420053098]
[2] [Anonymous], 2021, NHANES Tutorials
[3] [Anonymous], 2013, Vital and Health Statistics, Series 2
[4] Random forests
Breiman, L
[J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
[5] Breiman L., 2002, MANUAL SETTING USING
[6] CDC NC for HS, NHANES Questionnaires, datasets, and related documentation
[7] Using machine learning algorithms to identify chronic heart disease: National Health and Nutrition Examination Survey 2011-2018
Chen, Xiaofei
Guo, Dingjie
Wang, Yashan
Qu, Zihan
He, Guangliang
Sui, Chuanying
Lan, Linwei
Zhang, Xin
Duan, Yuqing
Meng, Hengyu
Wang, Chunpeng
Liu, Xin
[J]. JOURNAL OF CARDIOVASCULAR MEDICINE, 2023, 24 (07) : 461 - 466
[8] Comparisons of the prediction models for undiagnosed diabetes between machine learning versus traditional statistical methods
Choi, Seong Gyu
Oh, Minsuk
Park, Dong-Hyuk
Lee, Byeongchan
Lee, Yong-ho
Jee, Sun Ha
Jeon, Justin Y.
[J]. SCIENTIFIC REPORTS, 2023, 13 (01):
[9] Ten-year prediction of suicide death using Cox regression and machine learning in a nationwide retrospective cohort study in South Korea
Choi, Soo Beom
Lee, Wanhyung
Yoon, Jin-Ha
Won, Jong-Uk
Kim, Deok Won
[J]. JOURNAL OF AFFECTIVE DISORDERS, 2018, 231 : 8 - 14
[10] Dillon CF, 2006, J RHEUMATOL, V33, P2271

← 1 2 3 4 5 →