Predicting Measles Outbreaks in the United States: Evaluation of Machine Learning Approaches

被引:1
作者
Ru, Boshu [1 ,3 ]
Kujawski, Stephanie [2 ]
Afanador, Nelson Lee [2 ]
Baumgartner, Richard [2 ]
Pawaskar, Manjiri [2 ]
Das, Amar [2 ]
机构
[1] Merck & Co Inc, West Point, PA USA
[2] Merck & Co Inc, Rahway, NJ USA
[3] Merck & Co Inc, Sumneytown Pike 770, Main Stop WP37A, West Point, PA 19486 USA
关键词
measles; measles outbreaks; measles epidemiology; machine learning; epidemiology; hybrid machine learning; infectious disease modeling; infectious disease outbreak prediction; unsupervised machine learning; supervised machine learning; infectious disease; model; predict; outbreak; VACCINATION COVERAGE; CHILDREN; HESITANCY; KINDERGARTEN; VACCINES;
D O I
10.2196/42832
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background: Measles, a highly contagious viral infection, is resurging in the United States, driven by international importation and declining domestic vaccination coverage. Despite this resurgence, measles outbreaks are still rare events that are difficult to predict. Improved methods to predict outbreaks at the county level would facilitate the optimal allocation of public health resources. Objective: We aimed to validate and compare extreme gradient boosting (XGBoost) and logistic regression, 2 supervised learning approaches, to predict the US counties most likely to experience measles cases. We also aimed to assess the performance of hybrid versions of these models that incorporated additional predictors generated by 2 clustering algorithms, hierarchical density-based spatial clustering of applications with noise (HDBSCAN) and unsupervised random forest (uRF). Methods: We constructed a supervised machine learning model based on XGBoost and unsupervised models based on HDBSCAN and uRF. The unsupervised models were used to investigate clustering patterns among counties with measles outbreaks; these clustering data were also incorporated into hybrid XGBoost models as additional input variables. The machine learning models were then compared to logistic regression models with and without input from the unsupervised models. Results: Both HDBSCAN and uRF identified clusters that included a high percentage of counties with measles outbreaks. XGBoost and XGBoost hybrid models outperformed logistic regression and logistic regression hybrid models, with the area under the receiver operating curve values of 0.920-0.926 versus 0.900-0.908, the area under the precision-recall curve values of 0.522-0.532 versus 0.485-0.513, and F2 scores of 0.595-0.601 versus 0.385-0.426. Logistic regression or logistic regression hybrid models had higher sensitivity than XGBoost or XGBoost hybrid models (0.837-0.857 vs 0.704-0.735) but a lower positive predictive value (0.122-0.141 vs 0.340-0.367) and specificity (0.793-0.821 vs 0.952-0.958). The hybrid versions of the logistic regression and XGBoost models had slightly higher areas under the precision-recall curve, specificity, and positive predictive values than the respective models that did not include any unsupervised features. Conclusions: XGBoost provided more accurate predictions of measles cases at the county level compared with logistic regression. The threshold of prediction in this model can be adjusted to align with each county's resources, priorities, and risk for measles. While clustering pattern data from unsupervised machine learning approaches improved some aspects of model performance in this imbalanced data set, the optimal approach for the integration of such approaches with supervised machine learning models requires further investigation.
引用
收藏
页数:11
相关论文
共 56 条
[1]  
Akinwande M. O., 2015, Open Journal of Statistics, V5, P754, DOI [DOI 10.4236/OJS.2015.57075, https://doi.org/10.4236/ojs.2015.57075, 10.4236/ojs.2015.57075]
[2]  
Alizadeh-Sani Z, HIGHLIGHTS PRACTICAL
[3]  
[Anonymous], 2019, NAT NOT INF DIS COND
[4]  
[Anonymous], CDC/ATSDR Social Vulnerability Index
[5]  
[Anonymous], 2021, COR IMP AV IND WORLD
[6]  
[Anonymous], SMALL AR INC POV EST
[7]  
[Anonymous], Measles Cases and Outbreaks
[8]  
[Anonymous], MIDT DAT
[9]  
[Anonymous], 2018, LOC AR PERS INC
[10]  
[Anonymous], 2022, KEEP RAT