Reporting of demographic data and representativeness in machine learning models using electronic health records

被引:29
|
作者
Bozkurt, Selen [1 ]
Cahan, Eli M. [1 ,2 ]
Seneviratne, Martin G. [1 ]
Sun, Ran [1 ]
Lossio-Ventura, Juan A. [1 ]
Ioannidis, John P. A. [1 ,3 ,4 ,5 ,6 ]
Hernandez-Boussard, Tina [1 ,4 ,7 ]
机构
[1] Stanford Univ, Dept Med, Stanford, CA 94306 USA
[2] NYU, Sch Med, New York, NY USA
[3] Stanford Univ, Sch Med, Dept Epidemiol & Populat Hlth, Stanford, CA 94306 USA
[4] Stanford Univ, Dept Biomed Data Sci, Stanford, CA 94306 USA
[5] Stanford Univ, Dept Stat, Stanford, CA 94306 USA
[6] Stanford Univ, Metares Innovat Ctr Stanford, Stanford, CA 94306 USA
[7] Stanford Univ, Dept Surg, Stanford, CA 94306 USA
关键词
demographic data; machine learning; electronic health record; clinical decision support; bias; transparency; PREDICTION; RISK; BIAS;
D O I
10.1093/jamia/ocaa164
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective: The development of machine learning (ML) algorithms to address a variety of issues faced in clinical practice has increased rapidly. However, questions have arisen regarding biases in their development that can affect their applicability in specific populations. We sought to evaluate whether studies developing ML models from electronic health record (EHR) data report sufficient demographic data on the study populations to demonstrate representativeness and reproducibility. Materials and Methods: We searched PubMed for articles applying ML models to improve clinical decision-making using EHR data. We limited our search to papers published between 2015 and 2019. Results: Across the 164 studies reviewed, demographic variables were inconsistently reported and/or included as model inputs. Race/ethnicity was not reported in 64%; gender and age were not reported in 24% and 21% of studies, respectively. Socioeconomic status of the population was not reported in 92% of studies. Studies that mentioned these variables often did not report if they were included as model inputs. Few models (12%) were validated using external populations. Few studies (17%) open-sourced their code. Populations in the ML studies include higher proportions of White and Black yet fewer Hispanic subjects compared to the general US population. Discussion: The demographic characteristics of study populations are poorly reported in the ML literature based on EHR data. Demographic representativeness in training data and model transparency is necessary to ensure that ML models are deployed in an equitable and reproducible manner. Wider adoption of reporting guidelines is warranted to improve representativeness and reproducibility.
引用
收藏
页码:1878 / 1884
页数:7
相关论文
共 50 条
  • [41] Predicting Causes of Death from Structured Electronic Health Records Using Machine Learning
    Al-Garad, Mohammed
    Reeves, Ruth M.
    Desai, Rishi J.
    LeNoue-Newton, Michele
    Park, Daniel
    Wang, Shirley V.
    Maro, Judith C.
    Fuller, Candace C.
    Lin, Kueiyu Joshua
    Hernandez-Munoz, Jose J.
    Kuzucan, Aida
    Wang, Xi
    Pillai, Haritha
    Ngan, Kerry
    Whitaker, Jill
    Deere, Jessica
    McLemore, Michael F.
    Westerman, Dax
    Matheny, Michael E.
    PHARMACOEPIDEMIOLOGY AND DRUG SAFETY, 2024, 33 : 71 - 72
  • [42] Early Prediction of Gestational Diabetes Mellitus Using Electronic Health Records and Machine Learning
    Germaine, Mark A.
    O'Higgins, Amy C.
    Healy, Graham
    Egan, Brendan
    DIABETES, 2024, 73
  • [43] Predicting hypoglycemia in critically Ill patients using machine learning and electronic health records
    Mantena, Sreekar
    Arevalo, Aldo Robles
    Maley, Jason H.
    Vieira, Susana M. da Silva
    Mateo-Collado, Roselyn
    Sousa, Joao M. da Costa
    Celi, Leo Anthony
    JOURNAL OF CLINICAL MONITORING AND COMPUTING, 2022, 36 (05) : 1297 - 1303
  • [44] Estimation of postpartum depression risk from electronic health records using machine learning
    Guy Amit
    Irena Girshovitz
    Karni Marcus
    Yiye Zhang
    Jyotishman Pathak
    Vered Bar
    Pinchas Akiva
    BMC Pregnancy and Childbirth, 21
  • [45] Predicting Acute Kidney Injury: A Machine Learning Approach Using Electronic Health Records
    Abdullah, Sheikh S.
    Rostamzadeh, Neda
    Sedig, Kamran
    Garg, Amit X.
    McArthur, Eric
    INFORMATION, 2020, 11 (08)
  • [46] Panacea: A Novel Architecture for Electronic Health Records System using Blockchain and Machine Learning
    Shah, Deep Rahul
    Dhawan, Dev Ajay
    Shah, Samit Nikesh
    Shah, Pannag Rajesh
    Francis, Sofia
    2022 SECOND INTERNATIONAL CONFERENCE ON ADVANCES IN ELECTRICAL, COMPUTING, COMMUNICATION AND SUSTAINABLE TECHNOLOGIES (ICAECT), 2022,
  • [47] Predicting hypoglycemia in critically Ill patients using machine learning and electronic health records
    Sreekar Mantena
    Aldo Robles Arévalo
    Jason H. Maley
    Susana M. da Silva Vieira
    Roselyn Mateo-Collado
    João M. da Costa Sousa
    Leo Anthony Celi
    Journal of Clinical Monitoring and Computing, 2022, 36 : 1297 - 1303
  • [48] Early prediction of clinical deterioration using data-driven machine-learning modeling of electronic health records
    Ruiz, Victor M.
    Goldsmith, Michael P.
    Shi, Lingyun
    Simpao, Allan F.
    Galvez, Jorge A.
    Naim, Maryam Y.
    Nadkarni, Vinay
    Gaynor, J. William
    Tsui, Fuchiang
    JOURNAL OF THORACIC AND CARDIOVASCULAR SURGERY, 2022, 164 (01): : 211 - +
  • [49] Data Mining in Spine Surgery: Leveraging Electronic Health Records for Machine Learning and Clinical Research
    Staartjes, Victor E.
    Stienen, Martin N.
    NEUROSPINE, 2019, 16 (04) : 654 - 656
  • [50] Machine Learning Analysis for Data Incompleteness (MADI): Analyzing the Data Completeness of Patient Records Using a Random Variable Approach to Predict the Incompleteness of Electronic Health Records
    Gurupur, Varadraj P.
    Shelleh, Muhammed
    IEEE ACCESS, 2021, 9 : 95994 - 96001