ARCH: Large-scale knowledge graph via aggregated narrative codified health records analysis

被引:0
|
作者
Gan, Ziming [1 ]
Zhou, Doudou [2 ]
Rush, Everett [3 ]
Panickan, Vidul A. [4 ,5 ]
Hoe, Yuk-Lam [5 ]
Ostrouchovm, George [3 ]
Xu, Zhiwei [6 ]
Shen, Shuting [7 ]
Xiong, Xin [8 ]
Greco, Kimberly F. [8 ]
Hong, Chuan [7 ]
Bonzel, Clara-Lea [4 ]
Wend, Jun [4 ]
Costa, Lauren [5 ]
Cai, Tianrun [5 ,9 ]
Begoli, Edmon
Xiaj, Zongqi [10 ]
Gaziano, J. Michael [5 ,9 ]
Liao, Katherine P. [5 ,9 ]
Cho, Kelly [5 ,9 ]
Cai, Tianxi [4 ,5 ,8 ]
Lu, Junwei [5 ,8 ]
机构
[1] Univ Chicago, Dept Stat, 5801 S Ellis Ave, Chicago, IL 60615 USA
[2] Natl Univ Singapore, Dept Stat & Data Sci, Singapore 117546, Singapore
[3] Oak Ridge Natl Lab, Bethel Valley Rd, Oak Ridge, TN 37830 USA
[4] Harvard Med Sch, 25 Shattuck St, Boston, MA 02115 USA
[5] VA Boston Healthcare Syst, 150 S Huntington Ave, Boston, MA 02130 USA
[6] Univ Michigan, Dept Stat, 500 S State St, Ann Arbor, MI 48109 USA
[7] Duke Univ, Dept Biostat & Bioinformat, 1121 West Main St, Durham, NC 27708 USA
[8] Harvard TH Chan Sch Publ Hlth, 677 Huntington Ave, Boston, MA 02115 USA
[9] Brigham & Womens Hosp, 60 Fenwood Rd, Boston, MA 02115 USA
[10] Univ Pittsburgh, Clin & Translat Sci, 3501 Fifth Ave, Pittsburgh, PA 15260 USA
关键词
Electronic health records; Natural language processing; Representation learning; Knowledge graph; ALZHEIMER-DISEASE; IDENTIFY; MODERATE; RISK;
D O I
10.1016/j.jbi.2024.104761
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Objective: Electronic health record (EHR) systems contain a wealth of clinical data stored as both codified data and free-text narrative notes (NLP). The complexity of EHR presents challenges in feature representation, information extraction, and uncertainty quantification. To address these challenges, we proposed an efficient Aggregated naRrative Codified Health (ARCH) records analysis to generate a large-scale knowledge graph (KG) for a comprehensive set of EHR codified and narrative features. Methods: Using data from 12.5 million Veterans Affairs patients, ARCH first derives embedding vectors and generates similarities along with associated p-values to measure the strength of relatedness between clinical features with statistical certainty quantification. Next, ARCH performs a sparse embedding regression to remove indirect linkage between features to build a sparse KG. Finally, ARCH was validated on various clinical tasks, including detecting known relationships between entity pairs, predicting drug side effects, disease phenotyping, as well as sub-typing Alzheimer's disease patients. Results: ARCH produces high-quality clinical embeddings and KG for over 60,000 codified and narrative EHR concepts. The KG and embeddings are visualized in the R-shiny powered web-API.3 ARCH achieved high accuracy in detecting EHR concept relationships, with AUCs of 0.926 (codified) and 0.861 (NLP) for similar EHR concepts, and 0.810 (codified) and 0.843 (NLP) for related pairs. It detected drug side effects with a 0.723 AUC, which improved to 0.826 after fine-tuning. Using both codified and NLP features, the detection power increased significantly. Compared to other methods, ARCH has superior accuracy and enhances weakly supervised phenotyping algorithms' performance. Notably, it successfully categorized Alzheimer's patients into two subgroups with varying mortality rates. Conclusion: The proposed ARCH algorithm generates large-scale high-quality semantic representations and knowledge graph for both codified and NLP EHR features, useful for a wide range of predictive modeling tasks.
引用
收藏
页数:11
相关论文
共 40 条
  • [21] Understanding social and clinical associations with unemployment for people with schizophrenia and bipolar disorders: large-scale health records study
    Chilman, Natasha
    Laporte, Dionne
    Dorrington, Sarah
    Hatch, Stephani L.
    Morgan, Craig
    Okoroji, Celestin
    Stewart, Robert
    Das-Munshi, Jayati
    SOCIAL PSYCHIATRY AND PSYCHIATRIC EPIDEMIOLOGY, 2024, 59 (10) : 1709 - 1719
  • [22] Identifying progression subphenotypes of Alzheimer's disease from large-scale electronic health records with machine learning
    Zhou, Manqi
    Tang, Alice S.
    Zhang, Hao
    Xu, Zhenxing
    Ke, Alison M. C.
    Su, Chang
    Huang, Yu
    Mantyh, William G.
    Jaffee, Michael S.
    Rankin, Katherine P.
    Dekosky, Steven T.
    Zhou, Jiayu
    Guo, Yi
    Bian, Jiang
    Sirota, Marina
    Wang, Fei
    JOURNAL OF BIOMEDICAL INFORMATICS, 2025, 165
  • [23] An examination of large-scale electronic health records implementation in Primary Healthcare Centers in Saudi Arabia: a qualitative study
    Alzghaibi, Haitham A.
    FRONTIERS IN PUBLIC HEALTH, 2023, 11
  • [24] Towards large-scale MR thigh image analysis via an integrated quantification framework
    Tan, Chaowei
    Li, Kang
    Yan, Zhennan
    Yi, Jingru
    Wu, Pengxiang
    Yu, Hui Jing
    Engelke, Klaus
    Metaxas, Dimitris N.
    NEUROCOMPUTING, 2017, 229 : 63 - 76
  • [25] Construction of a large-scale maritime element semantic schema based on knowledge graph models for unmanned automated decision-making
    Li, Yong
    Liu, Xiaotong
    Wang, Zhishan
    Mei, Qiang
    Xie, Wenxin
    Yang, Yang
    Wang, Peng
    FRONTIERS IN MARINE SCIENCE, 2024, 11
  • [26] The Coronavirus Network Explorer: mining a large-scale knowledge graph for effects of SARS-CoV-2 on host cell function
    Kramer, Andreas
    Billaud, Jean-Noel
    Tugendreich, Stuart
    Shiffman, Dan
    Jones, Martin
    Green, Jeff
    BMC BIOINFORMATICS, 2021, 22 (01)
  • [27] Automating Large-scale Health Care Service Feedback Analysis: Sentiment Analysis and Topic Modeling Study
    Alexander, George
    Bahja, Mohammed
    Butt, Gibran Farook
    JMIR MEDICAL INFORMATICS, 2022, 10 (04) : 211 - 224
  • [28] The Coronavirus Network Explorer: mining a large-scale knowledge graph for effects of SARS-CoV-2 on host cell function
    Andreas Krämer
    Jean-Noël Billaud
    Stuart Tugendreich
    Dan Shiffman
    Martin Jones
    Jeff Green
    BMC Bioinformatics, 22
  • [29] Knowledge Modeling Method for Simulation Analysis of Large-scale Power System and Their Application in Automation Adjustment of Power Flow Driven by Knowledge
    Liu H.
    Wen J.
    Chen X.
    Huang H.
    Wang T.
    Wang H.
    Huang Y.
    Tang Y.
    Yang D.
    Zhongguo Dianji Gongcheng Xuebao/Proceedings of the Chinese Society of Electrical Engineering, 2023, 43 (05): : 1843 - 1854
  • [30] Association of LPA Variants With Aortic Stenosis A Large-Scale Study Using Diagnostic and Procedural Codes From Electronic Health Records
    Chen, Hao Yu
    Dufresne, Line
    Burr, Hannah
    Ambikkumar, Athithan
    Yasui, Niko
    Luk, Kevin
    Ranatunga, Dilrini K.
    Whitmer, Rachel A.
    Lathrop, Mark
    Engert, James C.
    Thanassoulis, George
    JAMA CARDIOLOGY, 2018, 3 (01) : 18 - 23