ARCH: Large-scale knowledge graph via aggregated narrative codified health records analysis

被引:0
|
作者
Gan, Ziming [1 ]
Zhou, Doudou [2 ]
Rush, Everett [3 ]
Panickan, Vidul A. [4 ,5 ]
Hoe, Yuk-Lam [5 ]
Ostrouchovm, George [3 ]
Xu, Zhiwei [6 ]
Shen, Shuting [7 ]
Xiong, Xin [8 ]
Greco, Kimberly F. [8 ]
Hong, Chuan [7 ]
Bonzel, Clara-Lea [4 ]
Wend, Jun [4 ]
Costa, Lauren [5 ]
Cai, Tianrun [5 ,9 ]
Begoli, Edmon
Xiaj, Zongqi [10 ]
Gaziano, J. Michael [5 ,9 ]
Liao, Katherine P. [5 ,9 ]
Cho, Kelly [5 ,9 ]
Cai, Tianxi [4 ,5 ,8 ]
Lu, Junwei [5 ,8 ]
机构
[1] Univ Chicago, Dept Stat, 5801 S Ellis Ave, Chicago, IL 60615 USA
[2] Natl Univ Singapore, Dept Stat & Data Sci, Singapore 117546, Singapore
[3] Oak Ridge Natl Lab, Bethel Valley Rd, Oak Ridge, TN 37830 USA
[4] Harvard Med Sch, 25 Shattuck St, Boston, MA 02115 USA
[5] VA Boston Healthcare Syst, 150 S Huntington Ave, Boston, MA 02130 USA
[6] Univ Michigan, Dept Stat, 500 S State St, Ann Arbor, MI 48109 USA
[7] Duke Univ, Dept Biostat & Bioinformat, 1121 West Main St, Durham, NC 27708 USA
[8] Harvard TH Chan Sch Publ Hlth, 677 Huntington Ave, Boston, MA 02115 USA
[9] Brigham & Womens Hosp, 60 Fenwood Rd, Boston, MA 02115 USA
[10] Univ Pittsburgh, Clin & Translat Sci, 3501 Fifth Ave, Pittsburgh, PA 15260 USA
关键词
Electronic health records; Natural language processing; Representation learning; Knowledge graph; ALZHEIMER-DISEASE; IDENTIFY; MODERATE; RISK;
D O I
10.1016/j.jbi.2024.104761
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Objective: Electronic health record (EHR) systems contain a wealth of clinical data stored as both codified data and free-text narrative notes (NLP). The complexity of EHR presents challenges in feature representation, information extraction, and uncertainty quantification. To address these challenges, we proposed an efficient Aggregated naRrative Codified Health (ARCH) records analysis to generate a large-scale knowledge graph (KG) for a comprehensive set of EHR codified and narrative features. Methods: Using data from 12.5 million Veterans Affairs patients, ARCH first derives embedding vectors and generates similarities along with associated p-values to measure the strength of relatedness between clinical features with statistical certainty quantification. Next, ARCH performs a sparse embedding regression to remove indirect linkage between features to build a sparse KG. Finally, ARCH was validated on various clinical tasks, including detecting known relationships between entity pairs, predicting drug side effects, disease phenotyping, as well as sub-typing Alzheimer's disease patients. Results: ARCH produces high-quality clinical embeddings and KG for over 60,000 codified and narrative EHR concepts. The KG and embeddings are visualized in the R-shiny powered web-API.3 ARCH achieved high accuracy in detecting EHR concept relationships, with AUCs of 0.926 (codified) and 0.861 (NLP) for similar EHR concepts, and 0.810 (codified) and 0.843 (NLP) for related pairs. It detected drug side effects with a 0.723 AUC, which improved to 0.826 after fine-tuning. Using both codified and NLP features, the detection power increased significantly. Compared to other methods, ARCH has superior accuracy and enhances weakly supervised phenotyping algorithms' performance. Notably, it successfully categorized Alzheimer's patients into two subgroups with varying mortality rates. Conclusion: The proposed ARCH algorithm generates large-scale high-quality semantic representations and knowledge graph for both codified and NLP EHR features, useful for a wide range of predictive modeling tasks.
引用
收藏
页数:11
相关论文
共 40 条
  • [31] Large-scale Urban Cellular Traffic Generation via Knowledge-Enhanced GANs with Multi-Periodic Patterns
    Hui, Shuodi
    Wang, Huandong
    Li, Tong
    Yang, Xinghao
    Wang, Xing
    Feng, Junlan
    Zhu, Lin
    Deng, Chao
    Hui, Pan
    Jin, Depeng
    Li, Yong
    PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023, 2023, : 4195 - 4206
  • [32] Large-scale Analysis of Free-Text Data for Mental Health Surveillance with Topic Modelling
    Gu, Yang
    Leroy, Gondy
    AMCIS 2020 PROCEEDINGS, 2020,
  • [33] The impact financial resources on implementation of large-scale electronic health records in the Saudi Arabia's primary healthcare centers: Mixed methods
    Alzghaibi, Haitham
    Mughal, Yasir Hayat
    Alkhamees, Mohammad
    Alasqah, Ibrahim
    Alhlayl, Adel Sulaiman
    Alwheeb, Mohammed Hamed
    Alrehiely, Majedah
    FRONTIERS IN PUBLIC HEALTH, 2022, 10
  • [34] Leveraging Large-Scale Electronic Health Records and Interpretable Machine Learning for Clinical Decision Making at the Emergency Department: Protocol for System Development and Validation
    Liu, Nan
    Xie, Feng
    Siddiqui, Fahad Javaid
    Ho, Andrew Fu Wah
    Chakraborty, Bibhas
    Nadarajan, Gayathri Devi
    Tan, Kenneth Boon Kiat
    Ong, Marcus Eng Hock
    JMIR RESEARCH PROTOCOLS, 2022, 11 (03):
  • [35] PALM: PATIENT-CENTERED TREATMENT RANKING VIA LARGE-SCALE MULTIVARIATE NETWORK META-ANALYSIS
    Duan, Rui
    Tong, Jiayi
    Lin, Lifeng
    Levine, Lisa
    Sammel, Mary
    Stoddard, Joel
    Li, Tianjing
    Schmid, Christopher H.
    Chu, Haitao
    Chen, Yong
    ANNALS OF APPLIED STATISTICS, 2023, 17 (01) : 815 - 837
  • [36] Shared genetics of asthma and mental health disorders: a large-scale genome-wide cross-trait analysis
    Zhu, Zhaozhong
    Zhu, Xi
    Liu, Cong-Lin
    Shi, Huwenbo
    Shen, Sipeng
    Yang, Yunqi
    Hasegawa, Kohei
    Camargo, Carlos A., Jr.
    Liang, Liming
    EUROPEAN RESPIRATORY JOURNAL, 2019, 54 (06)
  • [37] Long-Term Exposure to Elevated Systolic Blood Pressure in Predicting Incident Cardiovascular Disease: Evidence From Large-Scale Routine Electronic Health Records
    Solares, Jose Roberto Ayala
    Canoy, Dexter
    Raimondi, Francesca Elisa Diletta
    Zhu, Yajie
    Hassaine, Abdelaali
    Khorshidi, Gholamreza Salim
    Tran, Jenny
    Copland, Emma
    Zottoli, Mariaarazia
    Pinho-Gomes, Ana-Catarina
    Nazarzadeh, Milad
    Rahimi, Kazem
    JOURNAL OF THE AMERICAN HEART ASSOCIATION, 2019, 8 (12):
  • [38] Estimated Glomerular Filtration Rate and Hearing Impairment in Japan: A Longitudinal Analysis Using Large-Scale Occupational Health Check-Up Data
    Miyake, Hiroshi
    Michikawa, Takehiro
    Nagahama, Satsue
    Asakura, Keiko
    Nishiwaki, Yuji
    INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH, 2022, 19 (19)
  • [39] Artificial intelligence in health data analysis: The Darwinian evolution theory suggests an extremely simple and zero-cost large-scale screening tool for prediabetes and type 2 diabetes
    Buccheri, Enrico
    Dell'Aquila, Daniele
    Russo, Marco
    DIABETES RESEARCH AND CLINICAL PRACTICE, 2021, 174
  • [40] Schizophrenia pregnancies should be given greater health priority in the global health agenda: results from a large-scale meta-analysis of 43,611 deliveries of women with schizophrenia and 40,948,272 controls
    Etchecopar-Etchart, Damien
    Mignon, Roxane
    Boyer, Laurent
    Fond, Guillaume
    MOLECULAR PSYCHIATRY, 2022, 27 (08) : 3294 - 3305