Reporting of demographic data and representativeness in machine learning models using electronic health records

被引:29
|
作者
Bozkurt, Selen [1 ]
Cahan, Eli M. [1 ,2 ]
Seneviratne, Martin G. [1 ]
Sun, Ran [1 ]
Lossio-Ventura, Juan A. [1 ]
Ioannidis, John P. A. [1 ,3 ,4 ,5 ,6 ]
Hernandez-Boussard, Tina [1 ,4 ,7 ]
机构
[1] Stanford Univ, Dept Med, Stanford, CA 94306 USA
[2] NYU, Sch Med, New York, NY USA
[3] Stanford Univ, Sch Med, Dept Epidemiol & Populat Hlth, Stanford, CA 94306 USA
[4] Stanford Univ, Dept Biomed Data Sci, Stanford, CA 94306 USA
[5] Stanford Univ, Dept Stat, Stanford, CA 94306 USA
[6] Stanford Univ, Metares Innovat Ctr Stanford, Stanford, CA 94306 USA
[7] Stanford Univ, Dept Surg, Stanford, CA 94306 USA
关键词
demographic data; machine learning; electronic health record; clinical decision support; bias; transparency; PREDICTION; RISK; BIAS;
D O I
10.1093/jamia/ocaa164
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective: The development of machine learning (ML) algorithms to address a variety of issues faced in clinical practice has increased rapidly. However, questions have arisen regarding biases in their development that can affect their applicability in specific populations. We sought to evaluate whether studies developing ML models from electronic health record (EHR) data report sufficient demographic data on the study populations to demonstrate representativeness and reproducibility. Materials and Methods: We searched PubMed for articles applying ML models to improve clinical decision-making using EHR data. We limited our search to papers published between 2015 and 2019. Results: Across the 164 studies reviewed, demographic variables were inconsistently reported and/or included as model inputs. Race/ethnicity was not reported in 64%; gender and age were not reported in 24% and 21% of studies, respectively. Socioeconomic status of the population was not reported in 92% of studies. Studies that mentioned these variables often did not report if they were included as model inputs. Few models (12%) were validated using external populations. Few studies (17%) open-sourced their code. Populations in the ML studies include higher proportions of White and Black yet fewer Hispanic subjects compared to the general US population. Discussion: The demographic characteristics of study populations are poorly reported in the ML literature based on EHR data. Demographic representativeness in training data and model transparency is necessary to ensure that ML models are deployed in an equitable and reproducible manner. Wider adoption of reporting guidelines is warranted to improve representativeness and reproducibility.
引用
收藏
页码:1878 / 1884
页数:7
相关论文
共 50 条
  • [31] Using Electronic Health Records and Machine Learning to Predict Incident Psychiatric Hospitalization
    DeFerio, Joseph
    Banerjee, Samprit
    Alexopoulos, George
    Pathak, Jyotishman
    BIOLOGICAL PSYCHIATRY, 2020, 87 (09) : S68 - S69
  • [32] Predicting the Risk of Inpatient Hypoglycemia With Machine Learning Using Electronic Health Records
    Ruan, Yue
    Bellot, Alexis
    Moysova, Zuzana
    Tan, Garry D.
    Lumb, Alistair
    Davies, Jim
    van der Schaar, Mihaela
    Rea, Rustam
    DIABETES CARE, 2020, 43 (07) : 1504 - 1511
  • [33] Machine learning based prediction models for cardiovascular disease risk using electronic health records data: systematic review and meta-analysis
    Liu, Tianyi
    Krentz, Andrew
    Lu, Lei
    Curcin, Vasa
    EUROPEAN HEART JOURNAL - DIGITAL HEALTH, 2024, 6 (01): : 7 - 22
  • [34] A machine-learning prediction model to identify risk of firearm injury using electronic health records data
    Zhou, Hui
    Nau, Claudia
    Xie, Fagen
    Contreras, Richard
    Grant, Deborah Ling
    Negriff, Sonya
    Sidell, Margo
    Koebnick, Corinna
    Hechter, Rulin
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 31 (10) : 2173 - 2180
  • [35] The application of unsupervised deep learning in predictive models using electronic health records
    Lei Wang
    Liping Tong
    Darcy Davis
    Tim Arnold
    Tina Esposito
    BMC Medical Research Methodology, 20
  • [36] The application of unsupervised deep learning in predictive models using electronic health records
    Wang, Lei
    Tong, Liping
    Davis, Darcy
    Arnold, Tim
    Esposito, Tina
    BMC MEDICAL RESEARCH METHODOLOGY, 2020, 20 (01)
  • [37] Development and Evaluation of Machine Learning Models for the Identification of Surgical Site Infection in Electronic Health Records
    Chakraborty, Arjun
    Lybarger, Kevin
    Estebane, Jorge A. Olivas
    Chen, Judy Y.
    Patel, Mahul
    O'Reilly-Shah, Vikas
    Tarczy-Hornoch, Peter
    Yetisgen, Meliha
    Long, Dustin R.
    SURGICAL INFECTIONS, 2025,
  • [38] Joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage
    Tang, Jianxiang
    Wang, Xiaoyu
    Wan, Hongli
    Lin, Chunying
    Shao, Zilun
    Chang, Yang
    Wang, Hexuan
    Wu, Yi
    Zhang, Tao
    Du, Yu
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2022, 22 (01)
  • [39] Joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage
    Jianxiang Tang
    Xiaoyu Wang
    Hongli Wan
    Chunying Lin
    Zilun Shao
    Yang Chang
    Hexuan Wang
    Yi Wu
    Tao Zhang
    Yu Du
    BMC Medical Informatics and Decision Making, 22
  • [40] Estimation of postpartum depression risk from electronic health records using machine learning
    Amit, Guy
    Girshovitz, Irena
    Marcus, Karni
    Zhang, Yiye
    Pathak, Jyotishman
    Bar, Vered
    Akiva, Pinchas
    BMC PREGNANCY AND CHILDBIRTH, 2021, 21 (01)