Characterizing and Managing Missing Structured Data in Electronic Health Records: Data Analysis

被引:91
|
作者
Beaulieu-Jones, Brett K. [1 ,2 ]
Lavage, Daniel R. [3 ]
Snyder, John W. [3 ]
Moore, Jason H. [2 ]
Pendergrass, Sarah A. [3 ]
Bauer, Christopher R. [3 ]
机构
[1] Univ Penn, Perelman Sch Med, Genom & Comp Biol Grad Grp, Philadelphia, PA 19104 USA
[2] Univ Penn, Inst Biomed Informat, Philadelphia, PA 19104 USA
[3] Geisinger, Biomed & Translat Informat Inst, 100 N Acad Ave, Danville, PA 17822 USA
基金
美国国家卫生研究院;
关键词
imputation; missing data; clinical laboratory test results; electronic health records; MULTIPLE IMPUTATION; SENSITIVITY-ANALYSIS;
D O I
10.2196/medinform.8960
中图分类号
R-058 [];
学科分类号
摘要
Background: Missing data is a challenge for all studies; however, this is especially true for electronic health record (EHR)-based analyses. Failure to appropriately consider missing data can lead to biased results. While there has been extensive theoretical work on imputation, and many sophisticated methods are now available, it remains quite challenging for researchers to implement these methods appropriately. Here, we provide detailed procedures for when and how to conduct imputation of EHR laboratory results. Objective: The objective of this study was to demonstrate how the mechanism of missingness can be assessed, evaluate the performance of a variety of imputation methods, and describe some of the most frequent problems that can be encountered. Methods: We analyzed clinical laboratory measures from 602,366 patients in the EHR of Geisinger Health System in Pennsylvania, USA. Using these data, we constructed a representative set of complete cases and assessed the performance of 12 different imputation methods for missing data that was simulated based on 4 mechanisms of missingness (missing completely at random, missing not at random, missing at random, and real data modelling). Results: Our results showed that several methods, including variations of Multivariate Imputation by Chained Equations (MICE) and softImpute, consistently imputed missing values with low error; however, only a subset of the MICE methods was suitable for multiple imputation. Conclusions: The analyses we describe provide an outline of considerations for dealing with missing EHR data, steps that researchers can perform to characterize missingness within their own data, and an evaluation of methods that can be applied to impute clinical data. While the performance of methods may vary between datasets, the process we describe can be generalized to the majority of structured data types that exist in EHRs, and all of our methods and code are publicly available.
引用
收藏
页数:12
相关论文
共 50 条
  • [41] Recurrent Events Analysis With Data Collected at Informative Clinical Visits in Electronic Health Records
    Sun, Yifei
    McCulloch, Charles E.
    Marr, Kieren A.
    Huang, Chiung-Yu
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2021, 116 (534) : 594 - 604
  • [42] Gridded data as a source of missing data replacement in station records
    Meher, Jitendra Kumar
    Das, Lalu
    JOURNAL OF EARTH SYSTEM SCIENCE, 2019, 128 (03)
  • [43] Can the Use of Bayesian Analysis Methods Correct for Incompleteness in Electronic Health Records Diagnosis Data? Development of a Novel Method Using Simulated and Real-Life Clinical Data
    Ford, Elizabeth
    Rooney, Philip
    Hurley, Peter
    Oliver, Seb
    Bremner, Stephen
    Cassell, Jackie
    FRONTIERS IN PUBLIC HEALTH, 2020, 8
  • [44] Exploratory Data Analysis in Electronic Health Records Graphs: Intuitive Features and Visualization Tools
    Cazzolato, Mirela T.
    Gutierrez, Marco Antonio
    Traina, Cactano, Jr.
    Faloutsos, Christos
    Traina, Agma J. M.
    2023 IEEE 36TH INTERNATIONAL SYMPOSIUM ON COMPUTER-BASED MEDICAL SYSTEMS, CBMS, 2023, : 117 - 122
  • [45] Gridded data as a source of missing data replacement in station records
    Jitendra Kumar Meher
    Lalu Das
    Journal of Earth System Science, 2019, 128
  • [46] Learning from heterogeneous temporal data in electronic health records
    Zhao, Jing
    Papapetrou, Panagiotis
    Asker, Lars
    Bostrom, Henrik
    JOURNAL OF BIOMEDICAL INFORMATICS, 2017, 65 : 105 - 119
  • [47] Retrieving Clinical and Omic Data from Electronic Health Records
    Cabot, Chloe
    Lelong, Romain
    Grosjean, Julien
    Soualmia, Lina F.
    Darmoni, Stefan J.
    TRANSFORMING HEALTHCARE WITH THE INTERNET OF THINGS, 2016, 221 : 115 - 115
  • [48] Detecting Systemic Data Quality Issues in Electronic Health Records
    Ta, Casey N.
    Weng, Chunhua
    MEDINFO 2019: HEALTH AND WELLBEING E-NETWORKS FOR ALL, 2019, 264 : 383 - 387
  • [49] Empirical study of Data Completeness in Electronic Health Records in China
    Liu, Caihua
    Zowghi, Didar
    Talaei-Khoei, Amir
    Jin, Zhi
    PACIFIC ASIA JOURNAL OF THE ASSOCIATION FOR INFORMATION SYSTEMS, 2020, 12 (02): : 103 - 128
  • [50] The Effectiveness of Multitask Learning for Phenotyping with Electronic Health Records Data
    Ding, Daisy Yi
    Simpson, Chloe
    Pfohl, Stephen
    Kale, Dave C.
    Jung, Kenneth
    Shah, Nigam H.
    PACIFIC SYMPOSIUM ON BIOCOMPUTING 2019, 2019, : 18 - 29