Using electronic health records to identify candidates for human immunodeficiency virus pre-exposure prophylaxis: An application of super learning to risk prediction when the outcome is rare

被引:19
作者
Gruber, Susan [1 ]
Krakower, Douglas [2 ,3 ,4 ,5 ]
Menchaca, John T. [5 ]
Hsu, Katherine [6 ,7 ]
Hawrusik, Rebecca [6 ]
Maro, Judith C. [5 ]
Cocoros, Noelle M. [5 ]
Kruskal, Benjamin A. [8 ]
Wilson, Ira B. [9 ]
Mayer, Kenneth H. [2 ,3 ,4 ]
Klompas, Michael [5 ,10 ]
机构
[1] Putnam Data Sci LLC, 85 Putnam Ave, Cambridge, MA 02139 USA
[2] Beth Israel Deaconess Med Ctr, Div Infect Dis, Boston, MA 02215 USA
[3] Fenway Hlth, Fenway Inst, Boston, MA USA
[4] Harvard Med Sch, Boston, MA 02115 USA
[5] Harvard Med Sch, Dept Populat Med, Boston, MA 02115 USA
[6] Massachusetts Dept Publ Hlth, Boston, MA USA
[7] Boston Med Ctr, Dept Pediat, Boston, MA USA
[8] Atrius Hlth, Boston, MA USA
[9] Brown Univ, Dept Hlth Serv Policy & Practice, Providence, RI 02912 USA
[10] Brigham & Womens Hosp, Div Infect Dis, 75 Francis St, Boston, MA 02115 USA
关键词
EHR; machine learning; predictive modeling; PrEP; risk score prediction; super learner; HIV PREVENTION; MEN; SEX; VALIDATION; SCORE; AREA;
D O I
10.1002/sim.8591
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Human immunodeficiency virus (HIV) pre-exposure prophylaxis (PrEP) protects high risk patients from becoming infected with HIV. Clinicians need help to identify candidates for PrEP based on information routinely collected in electronic health records (EHRs). The greatest statistical challenge in developing a risk prediction model is that acquisition is extremely rare.Methods: Data consisted of 180 covariates (demographic, diagnoses, treatments, prescriptions) extracted from records on 399 385 patient (150 cases) seen at Atrius Health (2007-2015), a clinical network in Massachusetts. Super learner is an ensemble machine learning algorithm that usesk-fold cross validation to evaluate and combine predictions from a collection of algorithms. We trained 42 variants of sophisticated algorithms, using different sampling schemes that more evenly balanced the ratio of cases to controls. We compared super learner's cross validated area under the receiver operating curve (cv-AUC) with that of each individual algorithm.Results: The least absolute shrinkage and selection operator (lasso) using a 1:20 class ratio outperformed the super learner (cv-AUC = 0.86 vs 0.84). A traditional logistic regression model restricted to 23 clinician-selected main terms was slightly inferior (cv-AUC = 0.81).Conclusion: Machine learning was successful at developing a model to predict 1-year risk of acquiring HIV based on a physician-curated set of predictors extracted from EHRs.
引用
收藏
页码:3059 / 3073
页数:15
相关论文
共 45 条
[1]  
[Anonymous], 2016, HIV Surveillance Report
[2]  
[Anonymous], 2004, Statistics for Epidemiology
[3]   A survey of cross-validation procedures for model selection [J].
Arlot, Sylvain ;
Celisse, Alain .
STATISTICS SURVEYS, 2010, 4 :40-79
[4]   Antiretroviral Prophylaxis for HIV Prevention in Heterosexual Men and Women [J].
Baeten, J. M. ;
Donnell, D. ;
Ndase, P. ;
Mugo, N. R. ;
Campbell, J. D. ;
Wangisi, J. ;
Tappero, J. W. ;
Bukusi, E. A. ;
Cohen, C. R. ;
Katabira, E. ;
Ronald, A. ;
Tumwesigye, E. ;
Were, E. ;
Fife, K. H. ;
Kiarie, J. ;
Farquhar, C. ;
John-Stewart, G. ;
Kakia, A. ;
Odoyo, J. ;
Mucunguzi, A. ;
Nakku-Joloba, E. ;
Twesigye, R. ;
Ngure, K. ;
Apaka, C. ;
Tamooh, H. ;
Gabona, F. ;
Mujugira, A. ;
Panteleeff, D. ;
Thomas, K. K. ;
Kidoguchi, L. ;
Krows, M. ;
Revall, J. ;
Morrison, S. ;
Haugen, H. ;
Emmanuel-Ogier, M. ;
Ondrejcek, L. ;
Coombs, R. W. ;
Frenkel, L. ;
Hendrix, C. ;
Bumpus, N. N. ;
Bangsberg, D. ;
Haberer, J. E. ;
Stevens, W. S. ;
Lingappa, J. R. ;
Celum, C. .
NEW ENGLAND JOURNAL OF MEDICINE, 2012, 367 (05) :399-410
[5]   Machine Learning to Identify Persons at High-Risk of Human Immunodeficiency Virus Acquisition in Rural Kenya and Uganda [J].
Balzer, Laura B. ;
Havlir, Diane, V ;
Kamya, Moses R. ;
Chamie, Gabriel ;
Charlebois, Edwin D. ;
Clark, Tamara D. ;
Koss, Catherine A. ;
Kwarisiima, Dalsone ;
Ayieko, James ;
Sang, Norton ;
Kabami, Jane ;
Atukunda, Mucunguzi ;
Jain, Vivek ;
Camlin, Carol S. ;
Cohen, Craig R. ;
Bukusi, Elizabeth A. ;
Van der Laan, Mark ;
Petersen, Maya L. .
CLINICAL INFECTIOUS DISEASES, 2020, 71 (09) :2326-2333
[6]  
Bembom O, 2007, STAT APPL GENET MOL, V6
[7]  
Berwick R., 2003, IDIOTS GUIDE SUPPORT
[8]   Boosting for high-dimensional two-class prediction [J].
Blagus, Rok ;
Lusa, Lara .
BMC BIOINFORMATICS, 2015, 16
[9]  
Breiman L, 1996, MACH LEARN, V24, P49
[10]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32