Machine learning in public health informatics: Evidence that complex sampling structures may not be needed for prediction models with imbalanced outcomes

被引:0
作者
Si, Zhengye [1 ]
Li, Jinpu [1 ]
Leary, Emily [1 ]
机构
[1] Univ Missouri, Sch Med, Dept Orthopaed Surg, Thompson Lab Regenerat Orthopaed, 1100 Virginia Ave, Columbia, MO 65211 USA
关键词
Public health informatics; Machine learning; NationaL Surveys; Data collection; Methods; REGULARIZATION PATHS; NATIONAL-HEALTH;
D O I
10.1016/j.annepidem.2024.12.016
中图分类号
R1 [预防医学、卫生学];
学科分类号
1004 ; 120402 ;
摘要
Purpose: The objective of this study is to investigate the predictive ability of machine learning models for imbalanced outcomes from national survey data without the use of sampling weights. Methods: We evaluated the predictive performance of machine learning models on imbalanced outcomes from the US National Health and Nutrition Examination Survey (USNHANES) without using sampling weights. Four machine learning models (support vector machine, random forest, least absolute shrinkage and selection operator regression, and deep neural network) were compared with a logistic model that incorporates the survey's complex sampling design. Three resampling methods (oversampling, undersampling, and combined) were used to address class imbalance during the model training process. Results: For all models, the balanced accuracy was similar (ranging from 0.72 to 0.76) and the specificity was smaller than sensitivity except for random forest. The support vector machine and neural networks performed best with sensitivity (ranging from 0.79 to 0.83), while the random forest had the largest specificity (ranging from 0.86 to 0.96), with one exception. PR-AUC scores and Brier scores were low ranging from 0.2529 to 0.3313 (lower scores worse) and 0.1005-0.3245 (lower scores better), respectively Conclusions: The machine learning models had overall similar predictive capacity to the recommended methods which integrate the complex sampling design for the prediction of osteoarthritis occurrence with USNHANES.
引用
收藏
页码:75 / 80
页数:6
相关论文
共 43 条
  • [1] [Anonymous], 2017, Applied survey data analysis, DOI [10.1201/9781315153278, DOI 10.1201/9781315153278, 10.1201/9781420053098, DOI 10.1201/9781420053098]
  • [2] [Anonymous], 2021, NHANES Tutorials
  • [3] [Anonymous], 2013, Vital and Health Statistics, Series 2
  • [4] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [5] Breiman L., 2002, MANUAL SETTING USING
  • [6] CDC NC for HS, NHANES Questionnaires, datasets, and related documentation
  • [7] Using machine learning algorithms to identify chronic heart disease: National Health and Nutrition Examination Survey 2011-2018
    Chen, Xiaofei
    Guo, Dingjie
    Wang, Yashan
    Qu, Zihan
    He, Guangliang
    Sui, Chuanying
    Lan, Linwei
    Zhang, Xin
    Duan, Yuqing
    Meng, Hengyu
    Wang, Chunpeng
    Liu, Xin
    [J]. JOURNAL OF CARDIOVASCULAR MEDICINE, 2023, 24 (07) : 461 - 466
  • [8] Comparisons of the prediction models for undiagnosed diabetes between machine learning versus traditional statistical methods
    Choi, Seong Gyu
    Oh, Minsuk
    Park, Dong-Hyuk
    Lee, Byeongchan
    Lee, Yong-ho
    Jee, Sun Ha
    Jeon, Justin Y.
    [J]. SCIENTIFIC REPORTS, 2023, 13 (01):
  • [9] Ten-year prediction of suicide death using Cox regression and machine learning in a nationwide retrospective cohort study in South Korea
    Choi, Soo Beom
    Lee, Wanhyung
    Yoon, Jin-Ha
    Won, Jong-Uk
    Kim, Deok Won
    [J]. JOURNAL OF AFFECTIVE DISORDERS, 2018, 231 : 8 - 14
  • [10] Dillon CF, 2006, J RHEUMATOL, V33, P2271