Reporting of demographic data and representativeness in machine learning models using electronic health records

被引:29
|
作者
Bozkurt, Selen [1 ]
Cahan, Eli M. [1 ,2 ]
Seneviratne, Martin G. [1 ]
Sun, Ran [1 ]
Lossio-Ventura, Juan A. [1 ]
Ioannidis, John P. A. [1 ,3 ,4 ,5 ,6 ]
Hernandez-Boussard, Tina [1 ,4 ,7 ]
机构
[1] Stanford Univ, Dept Med, Stanford, CA 94306 USA
[2] NYU, Sch Med, New York, NY USA
[3] Stanford Univ, Sch Med, Dept Epidemiol & Populat Hlth, Stanford, CA 94306 USA
[4] Stanford Univ, Dept Biomed Data Sci, Stanford, CA 94306 USA
[5] Stanford Univ, Dept Stat, Stanford, CA 94306 USA
[6] Stanford Univ, Metares Innovat Ctr Stanford, Stanford, CA 94306 USA
[7] Stanford Univ, Dept Surg, Stanford, CA 94306 USA
关键词
demographic data; machine learning; electronic health record; clinical decision support; bias; transparency; PREDICTION; RISK; BIAS;
D O I
10.1093/jamia/ocaa164
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective: The development of machine learning (ML) algorithms to address a variety of issues faced in clinical practice has increased rapidly. However, questions have arisen regarding biases in their development that can affect their applicability in specific populations. We sought to evaluate whether studies developing ML models from electronic health record (EHR) data report sufficient demographic data on the study populations to demonstrate representativeness and reproducibility. Materials and Methods: We searched PubMed for articles applying ML models to improve clinical decision-making using EHR data. We limited our search to papers published between 2015 and 2019. Results: Across the 164 studies reviewed, demographic variables were inconsistently reported and/or included as model inputs. Race/ethnicity was not reported in 64%; gender and age were not reported in 24% and 21% of studies, respectively. Socioeconomic status of the population was not reported in 92% of studies. Studies that mentioned these variables often did not report if they were included as model inputs. Few models (12%) were validated using external populations. Few studies (17%) open-sourced their code. Populations in the ML studies include higher proportions of White and Black yet fewer Hispanic subjects compared to the general US population. Discussion: The demographic characteristics of study populations are poorly reported in the ML literature based on EHR data. Demographic representativeness in training data and model transparency is necessary to ensure that ML models are deployed in an equitable and reproducible manner. Wider adoption of reporting guidelines is warranted to improve representativeness and reproducibility.
引用
收藏
页码:1878 / 1884
页数:7
相关论文
共 50 条
  • [1] Delirium Prediction using Machine Learning Models on Preoperative Electronic Health Records Data
    Davoudi, Anis
    Ebadi, Ashkan
    Rashidi, Parisa
    Ozrazgat-Baslanti, Tazcan
    Bihorac, Azra
    Bursian, Alberto C.
    2017 IEEE 17TH INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOENGINEERING (BIBE), 2017, : 568 - 573
  • [2] Development and validation of models for detection of postoperative infections using structured electronic health records data and machine learning
    Colborn, Kathryn L.
    Zhuang, Yaxu
    Dyas, Adam R.
    Henderson, William G.
    Madsen, Helen J.
    Bronsert, Michael R.
    Matheny, Michael E.
    Lambert-Kerzner, Anne
    Myers, Quintin W. O.
    Meguid, Robert A.
    SURGERY, 2023, 173 (02) : 464 - 471
  • [3] Subphenotyping depression using machine learning and electronic health records
    Xu, Zhenxing
    Wang, Fei
    Adekkanattu, Prakash
    Bose, Budhaditya
    Vekaria, Veer
    Brandt, Pascal
    Jiang, Guoqian
    Kiefer, Richard C.
    Luo, Yuan
    Pacheco, Jennifer A.
    Rasmussen, Luke V.
    Xu, Jie
    Alexopoulos, George
    Pathak, Jyotishman
    LEARNING HEALTH SYSTEMS, 2020, 4 (04):
  • [4] Predicting neurodevelopmental disorders using machine learning models and electronic health records - status of the field
    Rajagopalan, Shyam Sundar
    Tammimies, Kristiina
    JOURNAL OF NEURODEVELOPMENTAL DISORDERS, 2024, 16 (01)
  • [5] Data Analytics and Machine Learning for Disease Identification in Electronic Health Records
    Benke, Kurt K.
    JAMA OPHTHALMOLOGY, 2019, 137 (05) : 497 - 498
  • [6] Machine learning in infection management using routine electronic health records: tools, techniques, and reporting of future technologies
    Luz, C. F.
    Vollmer, M.
    Decruyenaere, J.
    Nijsten, M. W.
    Glasner, C.
    Sinha, B.
    CLINICAL MICROBIOLOGY AND INFECTION, 2020, 26 (10) : 1291 - 1299
  • [7] Performance of Multiple Imputation Using Modern Machine Learning Methods in Electronic Health Records Data
    Getz, Kylie
    Hubbard, Rebecca A.
    Linn, Kristin A.
    EPIDEMIOLOGY, 2023, 34 (02) : 206 - 215
  • [8] Prediction and diagnosis of depression using machine learning with electronic health records data: a systematic review
    Nickson, David
    Meyer, Caroline
    Walasek, Lukasz
    Toro, Carla
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2023, 23 (01)
  • [9] A Framework for Systematic Assessment of Clinical Trial Population Representativeness Using Electronic Health Records Data
    Sun, Yingcheng
    Butler, Alex
    Diallo, Ibrahim
    Kim, Jae Hyun
    Ta, Casey
    Rogers, James R.
    Liu, Hao
    Weng, Chunhua
    APPLIED CLINICAL INFORMATICS, 2021, 12 (04): : 816 - 825
  • [10] Prediction and diagnosis of depression using machine learning with electronic health records data: a systematic review
    David Nickson
    Caroline Meyer
    Lukasz Walasek
    Carla Toro
    BMC Medical Informatics and Decision Making, 23