Reporting of demographic data and representativeness in machine learning models using electronic health records

被引：29

作者：

Bozkurt, Selen ^{[1
]}

Cahan, Eli M. ^{[1
,2
]}

Seneviratne, Martin G. ^{[1
]}

Sun, Ran ^{[1
]}

Lossio-Ventura, Juan A. ^{[1
]}

Ioannidis, John P. A. ^{[1
,3
,4
,5
,6
]}

Hernandez-Boussard, Tina ^{[1
,4
,7
]}

机构：

[1] Stanford Univ, Dept Med, Stanford, CA 94306 USA

[2] NYU, Sch Med, New York, NY USA

[3] Stanford Univ, Sch Med, Dept Epidemiol & Populat Hlth, Stanford, CA 94306 USA

[4] Stanford Univ, Dept Biomed Data Sci, Stanford, CA 94306 USA

[5] Stanford Univ, Dept Stat, Stanford, CA 94306 USA

[6] Stanford Univ, Metares Innovat Ctr Stanford, Stanford, CA 94306 USA

[7] Stanford Univ, Dept Surg, Stanford, CA 94306 USA

来源：

JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION | 2020年 / 27卷 / 12期

关键词：

demographic data; machine learning; electronic health record; clinical decision support; bias; transparency; PREDICTION; RISK; BIAS;

D O I：

10.1093/jamia/ocaa164

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Objective: The development of machine learning (ML) algorithms to address a variety of issues faced in clinical practice has increased rapidly. However, questions have arisen regarding biases in their development that can affect their applicability in specific populations. We sought to evaluate whether studies developing ML models from electronic health record (EHR) data report sufficient demographic data on the study populations to demonstrate representativeness and reproducibility. Materials and Methods: We searched PubMed for articles applying ML models to improve clinical decision-making using EHR data. We limited our search to papers published between 2015 and 2019. Results: Across the 164 studies reviewed, demographic variables were inconsistently reported and/or included as model inputs. Race/ethnicity was not reported in 64%; gender and age were not reported in 24% and 21% of studies, respectively. Socioeconomic status of the population was not reported in 92% of studies. Studies that mentioned these variables often did not report if they were included as model inputs. Few models (12%) were validated using external populations. Few studies (17%) open-sourced their code. Populations in the ML studies include higher proportions of White and Black yet fewer Hispanic subjects compared to the general US population. Discussion: The demographic characteristics of study populations are poorly reported in the ML literature based on EHR data. Demographic representativeness in training data and model transparency is necessary to ensure that ML models are deployed in an equitable and reproducible manner. Wider adoption of reporting guidelines is warranted to improve representativeness and reproducibility.

引用

页码：1878 / 1884

页数：7

共 50 条

[31] Using Electronic Health Records and Machine Learning to Predict Incident Psychiatric Hospitalization
DeFerio, Joseph
Banerjee, Samprit
Alexopoulos, George
Pathak, Jyotishman
BIOLOGICAL PSYCHIATRY, 2020, 87 (09) : S68 - S69
[32] Predicting the Risk of Inpatient Hypoglycemia With Machine Learning Using Electronic Health Records
Ruan, Yue
Bellot, Alexis
Moysova, Zuzana
Tan, Garry D.
Lumb, Alistair
Davies, Jim
van der Schaar, Mihaela
Rea, Rustam
DIABETES CARE, 2020, 43 (07) : 1504 - 1511
[33] Machine learning based prediction models for cardiovascular disease risk using electronic health records data: systematic review and meta-analysis
Liu, Tianyi
Krentz, Andrew
Lu, Lei
Curcin, Vasa
EUROPEAN HEART JOURNAL - DIGITAL HEALTH, 2024, 6 (01): : 7 - 22
[34] A machine-learning prediction model to identify risk of firearm injury using electronic health records data
Zhou, Hui
Nau, Claudia
Xie, Fagen
Contreras, Richard
Grant, Deborah Ling
Negriff, Sonya
Sidell, Margo
Koebnick, Corinna
Hechter, Rulin
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 31 (10) : 2173 - 2180
[35] The application of unsupervised deep learning in predictive models using electronic health records
Lei Wang
Liping Tong
Darcy Davis
Tim Arnold
Tina Esposito
BMC Medical Research Methodology, 20
[36] The application of unsupervised deep learning in predictive models using electronic health records
Wang, Lei
Tong, Liping
Davis, Darcy
Arnold, Tim
Esposito, Tina
BMC MEDICAL RESEARCH METHODOLOGY, 2020, 20 (01)
[37] Development and Evaluation of Machine Learning Models for the Identification of Surgical Site Infection in Electronic Health Records
Chakraborty, Arjun
Lybarger, Kevin
Estebane, Jorge A. Olivas
Chen, Judy Y.
Patel, Mahul
O'Reilly-Shah, Vikas
Tarczy-Hornoch, Peter
Yetisgen, Meliha
Long, Dustin R.
SURGICAL INFECTIONS, 2025,
[38] Joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage
Tang, Jianxiang
Wang, Xiaoyu
Wan, Hongli
Lin, Chunying
Shao, Zilun
Chang, Yang
Wang, Hexuan
Wu, Yi
Zhang, Tao
Du, Yu
BMC MEDICAL INFORMATICS AND DECISION MAKING, 2022, 22 (01)
[39] Joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage
Jianxiang Tang
Xiaoyu Wang
Hongli Wan
Chunying Lin
Zilun Shao
Yang Chang
Hexuan Wang
Yi Wu
Tao Zhang
Yu Du
BMC Medical Informatics and Decision Making, 22
[40] Estimation of postpartum depression risk from electronic health records using machine learning
Amit, Guy
Girshovitz, Irena
Marcus, Karni
Zhang, Yiye
Pathak, Jyotishman
Bar, Vered
Akiva, Pinchas
BMC PREGNANCY AND CHILDBIRTH, 2021, 21 (01)

← 1 2 3 4 5 →