Leveraging multi-site electronic health data for characterization of subtypes: a pilot study of dementia in the N3C Clinical Tenant

被引:1
作者
Sharma, Suchetha [1 ]
Liu, Jiebei [2 ]
Abramowitz, Amy Caroline [3 ]
Geary, Carol Reynolds [4 ]
Johnston, Karen C. [5 ]
Manning, Carol [5 ]
Van Horn, John Darrell [1 ]
Zhou, Andrea [6 ]
Anzalone, Alfred J. [7 ]
Loomba, Johanna [6 ]
Pfaff, Emily [8 ]
Brown, Don [9 ]
机构
[1] Univ Virginia, Sch Data Sci, Charlottesville, VA 22903 USA
[2] Univ Virginia, Dept Syst Engn, Charlottesville, VA 22904 USA
[3] Univ North Carolina Chapel Hill, Sch Med, Dept Psychiat, Chapel Hill, NC 27514 USA
[4] Univ Nebraska Med Ctr, Dept Pathol Microbiol & Immunol, Omaha, NE 68198 USA
[5] Univ Virginia, Dept Neurol, Charlottesville, VA 22903 USA
[6] Univ Virginia, Sch Med, Charlottesville, VA 22903 USA
[7] Univ Nebraska Med Ctr, Dept Biostat, Omaha, NE 68198 USA
[8] Univ North Carolina Chapel Hill, North Carolina Translat & Clin Sci Inst, Dept Med, Chapel Hill, NC 27599 USA
[9] Univ Virginia, Codirector integrated Translat Hlth Res Inst Virgi, Sch Data Sci, Charlottesville, VA 22903 USA
关键词
dementia subtypes; electronic health records; machine learning algorithms; comorbidity patterns; multi-institutional studies; SMOTE;
D O I
10.1093/jamiaopen/ooae076
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Objectives To provide a foundational methodology for differentiating comorbidity patterns in subphenotypes through investigation of a multi-site dementia patient dataset.Materials and Methods Employing the National Clinical Cohort Collaborative Tenant Pilot (N3C Clinical) dataset, our approach integrates machine learning algorithms-logistic regression and eXtreme Gradient Boosting (XGBoost)-with a diagnostic hierarchical model for nuanced classification of dementia subtypes based on comorbidities and gender. The methodology is enhanced by multi-site EHR data, implementing a hybrid sampling strategy combining 65% Synthetic Minority Over-sampling Technique (SMOTE), 35% Random Under-Sampling (RUS), and Tomek Links for class imbalance. The hierarchical model further refines the analysis, allowing for layered understanding of disease patterns.Results The study identified significant comorbidity patterns associated with diagnosis of Alzheimer's, Vascular, and Lewy Body dementia subtypes. The classification models achieved accuracies up to 69% for Alzheimer's/Vascular dementia and highlighted challenges in distinguishing Dementia with Lewy Bodies. The hierarchical model elucidates the complexity of diagnosing Dementia with Lewy Bodies and reveals the potential impact of regional clinical practices on dementia classification.Conclusion Our methodology underscores the importance of leveraging multi-site datasets and tailored sampling techniques for dementia research. This framework holds promise for extending to other disease subtypes, offering a pathway to more nuanced and generalizable insights into dementia and its complex interplay with comorbid conditions.Discussion This study underscores the critical role of multi-site data analyzes in understanding the relationship between comorbidities and disease subtypes. By utilizing diverse healthcare data, we emphasize the need to consider site-specific differences in clinical practices and patient demographics. Despite challenges like class imbalance and variability in EHR data, our findings highlight the essential contribution of multi-site data to developing accurate and generalizable models for disease classification. This study aims to enhance our understanding and classification of dementia subtypes using data from multiple healthcare sites. Dementia includes forms like Alzheimer's, Vascular, and Lewy Body dementia, each with unique health conditions. Researchers analyzed data from 9 US sites using a multi-stage approach with machine learning techniques, specifically logistic regression and eXtreme Gradient Boosting (XGBoost).The methodology involved 3 steps. First, the dataset was refined to focus on well-represented dementia subtypes. Next, advanced techniques balanced the data for fair representation. Finally, machine learning models classified the dementia types based on comorbidities and gender differences, achieving up to 70% accuracy for Alzheimer's and Vascular dementia, but finding Lewy Body dementia more challenging. A hierarchical model was used to address site-specific variations, revealing disparities among sites and improving generalization across populations.This study highlights the complexity of diagnosing dementia subtypes and the limitations of single-site studies, which often suffer from biases. By leveraging data from multiple sites, the research underscores the importance of multi-site dataset analysis for better generalization. This approach enhances understanding of dementia and provides a framework applicable to other diseases.
引用
收藏
页数:13
相关论文
共 52 条
  • [1] PyMC: a modern, and comprehensive probabilistic programming framework in Python']Python
    Abril-Pla, Oriol
    Andreani, Virgile
    Carroll, Colin
    Dong, Larry
    Fonnesbeck, Christopher J.
    Kochurov, Maxim
    Kumar, Ravin
    Lao, Junpeng
    Luhmann, Christian C.
    Martin, Osvaldo A.
    Osthege, Michael
    Vieira, Ricardo
    Wiecki, Thomas
    Zinkov, Robert
    [J]. PEERJ COMPUTER SCIENCE, 2023, 9
  • [2] Allenby G., 2005, SSRN Electronic Journal, DOI [10.2139/ssrn.655541, DOI 10.2139/SSRN.655541]
  • [3] Almowil ZA, 2021, INT J POPUL DATA SCI, V6, DOI [10.23889/ijpds.v6i1.1362, 10.23889/ijpds.v5i1.1362]
  • [4] Development and Validation of eRADAR: A Tool Using EHR Data to Detect Unrecognized Dementia
    Barnes, Deborah E.
    Zhou, Jing
    Walker, Rod L.
    Larson, Eric B.
    Lee, Sei J.
    Boscardin, W. John
    Marcum, Zachary A.
    Dublin, Sascha
    [J]. JOURNAL OF THE AMERICAN GERIATRICS SOCIETY, 2020, 68 (01) : 103 - 111
  • [5] Batista GEAPA., 2004, ACM SIGKDD EXPL NEWS, V6, P20, DOI [DOI 10.1145/1007730.1007735, 10.1145/1007730.1007735, 10.1145/1007730.1007735.2]
  • [6] Accuracy of the Clinical Diagnosis of Alzheimer Disease at National Institute on Aging Alzheimer Disease Centers, 2005-2010
    Beach, Thomas G.
    Monsell, Sarah E.
    Phillips, Leslie E.
    Kukull, Walter
    [J]. JOURNAL OF NEUROPATHOLOGY AND EXPERIMENTAL NEUROLOGY, 2012, 71 (04) : 266 - 273
  • [7] The MRI pattern of frontal and temporal brain atrophy in fronto-temporal dementia
    Boccardi, M
    Laakso, MP
    Bresciani, L
    Galluzzi, S
    Geroldi, C
    Beltramello, A
    Soininen, H
    Frisoni, GB
    [J]. NEUROBIOLOGY OF AGING, 2003, 24 (01) : 95 - 103
  • [8] Boonyasai RT., 2022, J Am Med Inform Association, V29, P609, DOI [10.1093/jamia/ocab278, DOI 10.1093/JAMIA/OCAB278]
  • [9] Burrows Evanette K, 2020, AMIA Jt Summits Transl Sci Proc, V2020, P71
  • [10] SMOTE: Synthetic minority over-sampling technique
    Chawla, Nitesh V.
    Bowyer, Kevin W.
    Hall, Lawrence O.
    Kegelmeyer, W. Philip
    [J]. 2002, American Association for Artificial Intelligence (16)