Development and application of a high throughput natural language processing architecture to convert all clinical documents in a clinical data warehouse into standardized medical vocabularies

被引:16
作者
Afshar, Majid [1 ,2 ]
Dligach, Dmitriy [1 ,2 ,3 ]
Sharma, Brihat [3 ]
Cai, Xiaoyuan [4 ]
Boyda, Jason [4 ]
Birch, Steven [4 ]
Valdez, Daniel [4 ]
Zelisko, Suzan [4 ]
Joyce, Cara [1 ,2 ]
Modave, Francois [1 ,2 ]
Price, Ron [1 ,4 ]
机构
[1] Loyola Univ Chicago, Ctr Hlth Outcomes & Informat Res, Hlth Sci Div, 2160 S First Ave,Bldg 115,Room 445, Maywood, IL 60156 USA
[2] Loyola Univ Chicago, Stritch Sch Med, Dept Publ Hlth Sci, Maywood, IL 60156 USA
[3] Loyola Univ, Dept Comp Sci, Chicago, IL 60611 USA
[4] Loyola Univ Chicago, Informat & Syst Dev Hlth Sci Div, Maywood, IL 60156 USA
关键词
natural language processing; unstructured data; clinical text and knowledge extraction system; data architecture; unified medical language system; BIG DATA; SYSTEM; EXTRACTION; TEXT; NLP;
D O I
10.1093/jamia/ocz068
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective: Natural language processing (NLP) engines such as the clinical Text Analysis and Knowledge Extraction System are a solution for processing notes for research, but optimizing their performance for a clinical data warehouse remains a challenge. We aim to develop a high throughput NLP architecture using the clinical Text Analysis and Knowledge Extraction System and present a predictive model use case. Materials and Methods: The CDW was comprised of 1 103 038 patients across 10 years. The architecture was constructed using the Hadoop data repository for source data and 3 large-scale symmetric processing servers for NLP. Each named entity mention in a clinical document was mapped to the Unified Medical Language System concept unique identifier (CUI). Results: The NLP architecture processed 83 867 802 clinical documents in 13.33 days and produced 37 721 886 606 CUIs across 8 standardized medical vocabularies. Performance of the architecture exceeded 500 000 documents per hour across 30 parallel instances of the clinical Text Analysis and Knowledge Extraction System including 10 instances dedicated to documents greater than 20 000 bytes. In a use-case example for predicting 30-day hospital readmission, a CUI-based model had similar discrimination to n-grams with an area under the curve receiver operating characteristic of 0.75 (95% CI, 0.74-0.76). Discussion and Conclusion: Our health system's high throughput NLP architecture may serve as a benchmark for large-scale clinical research using a CUI-based approach.
引用
收藏
页码:1364 / 1369
页数:6
相关论文
共 27 条
  • [1] A Natural Language Processing Framework for Assessing Hospital Readmissions for Patients With COPD
    Agarwal, Ankur
    Baechle, Christopher
    Behara, Ravi
    Zhu, Xingquan
    [J]. IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2018, 22 (02) : 588 - 596
  • [2] Using natural language processing to identify problem usage of prescription opioids
    Carrell, David S.
    Cronkite, David
    Palmer, Roy E.
    Saunders, Kathleen
    Gross, David E.
    Masters, Elizabeth T.
    Hylan, Timothy R.
    Von Korff, Michael
    [J]. INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2015, 84 (12) : 1057 - 1064
  • [3] Castro VM, 2017, NEUROLOGY, V88, P164, DOI 10.1212/WNL.0000000000003490
  • [4] Development and validation of machine learning models to identify high-risk surgical patients using automatically curated electronic health record data (Pythia): A retrospective, single-site study
    Corey, Kristin M.
    Kashyap, Sehj
    Lorenzi, Elizabeth
    Lagoo-Deenadayalan, Sandhya A.
    Heller, Katherine
    Whalen, Krista
    Balu, Suresh
    Heflin, Mitchell T.
    McDonald, Shelley R.
    Swaminathan, Madhav
    Sendak, Mark
    [J]. PLOS MEDICINE, 2018, 15 (11)
  • [5] Casemix adjustment of managed care claims data using the Clinical Classification for Health Policy Research method
    Cowen, ME
    Dusseau, DJ
    Toth, BG
    Guisinger, C
    Zodet, MW
    Shyr, Y
    [J]. MEDICAL CARE, 1998, 36 (07) : 1108 - 1113
  • [6] Association Between Hospital Penalty Status Under the Hospital Readmission Reduction Program and Readmission Rates for Target and Nontarget Conditions
    Desai, Nihar R.
    Ross, Joseph S.
    Kwon, Ji Young
    Herrin, Jeph
    Dharmarajan, Kumar
    Bernheim, Susannah M.
    Krumholz, Harlan M.
    Horwitz, Leora I.
    [J]. JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION, 2016, 316 (24): : 2647 - 2656
  • [7] Scaling-up NLP Pipelines to Process Large Corpora of Clinical Notes
    Divita, G.
    Carter, M.
    Redd, A.
    Zeng, Q.
    Gupta, K.
    Trautner, B.
    Samore, M.
    Gundlapalli, A.
    [J]. METHODS OF INFORMATION IN MEDICINE, 2015, 54 (06) : 548 - 552
  • [8] Extracting information from the text of electronic medical records to improve case detection: a systematic review
    Ford, Elizabeth
    Carroll, John A.
    Smith, Helen E.
    Scott, Donia
    Cassell, Jackie A.
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2016, 23 (05) : 1007 - 1015
  • [9] Gonzalez-Hernandez G, 2017, Yearb Med Inform, V26, P214, DOI 10.15265/IY-2017-029
  • [10] Automated feature selection of predictors in electronic medical records data
    Gronsbell, Jessica
    Minnier, Jessica
    Yu, Sheng
    Liao, Katherine
    Cai, Tianxi
    [J]. BIOMETRICS, 2019, 75 (01) : 268 - 277