Using Clinical Notes and Natural Language Processing for Automated HIV Risk Assessment

被引:80
作者
Feller, Daniel J. [1 ]
Zucker, Jason [2 ]
Yin, Michael T. [2 ]
Gordon, Peter [2 ]
Elhadad, Noemie [1 ]
机构
[1] Columbia Univ, Dept Biomed Informat, 622 West 168th St, New York, NY 10032 USA
[2] Columbia Univ, Dept Med, Div Infect Dis, New York, NY 10032 USA
基金
美国国家卫生研究院;
关键词
predictive analytics; social determinants of health; HIV; natural language processing; prevention; COST-EFFECTIVENESS; HEALTH; ANALYTICS; PROGRAM; DISEASE; MODEL;
D O I
10.1097/QAI.0000000000001580
中图分类号
R392 [医学免疫学]; Q939.91 [免疫学];
学科分类号
100102 ;
摘要
Objective: Universal HIV screening programs are costly, labor intensive, and often fail to identify high-risk individuals. Automated risk assessment methods that leverage longitudinal electronic health records (EHRs) could catalyze targeted screening programs. Although social and behavioral determinants of health are typically captured in narrative documentation, previous analyses have considered only structured EHR fields. We examined whether natural language processing (NLP) would improve predictive models of HIV diagnosis. Methods: One hundred eighty-one HIV+ individuals received care at New York Presbyterian Hospital before a confirmatory HIV diagnosis and 543 HIV negative controls were selected using propensity score matching and included in the study cohort. EHR data including demographics, laboratory tests, diagnosis codes, and unstructured notes before HIV diagnosis were extracted for modeling. Three predictive algorithms were developed using machine-learning algorithms: (1) a baseline model with only structured EHR data, (2) baseline plus NLP topics, and (3) baseline plus NLP clinical keywords. Results: Predictive models demonstrated a range of performance with F measures of 0.59 for the baseline model, 0.63 for the baseline + NLP topic model, and 0.74 for the baseline + NLP keyword model. The baseline + NLP keyword model yielded the highest precision by including keywords including "msm," "unprotected," "hiv," and " methamphetamine," and structured EHR data indicative of additional HIV risk factors. Conclusions: NLP improved the predictive performance of automated HIV risk assessment by extracting terms in clinical text indicative of high-risk behavior. Future studies should explore more advanced techniques for extracting social and behavioral determinants from clinical text.
引用
收藏
页码:160 / 166
页数:7
相关论文
共 42 条
  • [21] Hira Zena M., 2015, Advances in Bioinformatics, V2015, P198363, DOI 10.1155/2015/198363
  • [22] Costs and consequences of the US centers for disease control and prevention's recommendations for opt-out HIV testing
    Holtgrave, David R.
    [J]. PLOS MEDICINE, 2007, 4 (06) : 1011 - 1018
  • [23] Evaluation of hidden HIV infections in an urban ED with a rapid HIV screening program
    Hsieh, Yu-Hsiang
    Kelen, Gabor D.
    Beck, Kaylin J.
    Kraus, Chadd K.
    Shahan, Judy B.
    Laeyendecker, Oliver B.
    Quinn, Thomas C.
    Rothman, Richard E.
    [J]. AMERICAN JOURNAL OF EMERGENCY MEDICINE, 2016, 34 (02) : 180 - 184
  • [24] Krakower D, 2016, OPEN FORUM INFECT DI, V3
  • [25] Missed Opportunities for Repeat HIV Testing in Pregnancy: Implications for Elimination of Mother-to-Child Transmission in the United States
    Liao, Caiyun
    Golden, William Christopher
    Anderson, Jean R.
    Coleman, Jenell S.
    [J]. AIDS PATIENT CARE AND STDS, 2017, 31 (01) : 20 - 26
  • [26] Routine Rapid HIV Screening in Six Community Health Centers Serving Populations at Risk
    Myers, Janet J.
    Modica, Cheryl
    Dufour, Mi-Suk Kang
    Bernstein, Caryn
    McNamara, Kathleen
    [J]. JOURNAL OF GENERAL INTERNAL MEDICINE, 2009, 24 (12) : 1269 - 1274
  • [27] Expanded screening for HIV in the United States - An analysis of cost-effectiveness
    Paltiel, AD
    Weinstein, MC
    Kimmel, AD
    Seage, GR
    Losina, E
    Zhang, H
    Freedberg, KA
    Walensky, RP
    [J]. NEW ENGLAND JOURNAL OF MEDICINE, 2005, 352 (06) : 586 - 595
  • [28] Risk prediction for chronic kidney disease progression using heterogeneous electronic health record data and time series analysis
    Perotte, Adler
    Ranganath, Rajesh
    Hirsch, Jamie S.
    Blei, David
    Elhadad, Noemie
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2015, 22 (04) : 872 - 880
  • [29] Learning probabilistic phenotypes from heterogeneous EHR data
    Pivovarov, Rimma
    Perotte, Adler J.
    Grave, Edouard
    Angiolillo, John
    Wiggins, Chris H.
    Elhadad, Noemie
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2015, 58 : 156 - 165
  • [30] Identifying and mitigating biases in EHR laboratory tests
    Pivovarov, Rimma
    Albers, David J.
    Sepulveda, Jorge L.
    Elhadad, Noemie
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2014, 51 : 24 - 34