EHRtemporalVariability: delineating temporal data-set shifts in electronic health records

被引:19
作者
Saez, Carlos [1 ,2 ]
Gutierrez-Sacristan, Alba [2 ]
Kohane, Isaac [2 ]
Garcia-Gomez, Juan M. [1 ]
Avillach, Paul [2 ,3 ]
机构
[1] Univ Politecn Valencia, Inst Univ Tecnol Informac & Comunicac, Biomed Data Sci Lab, Camino Vera S-N, Valencia 46022, Spain
[2] Harvard Med Sch, Dept Biomed Informat, Boston, MA 02115 USA
[3] Boston Childrens Hosp, Computat Hlth Informat Program, Boston, MA USA
基金
欧盟地平线“2020”;
关键词
data-set shifts; data quality; temporal variability; scientific data sets; electronic health records; claims data; research repositories; information geometry; visual analytics; R package;
D O I
10.1093/gigascience/giaa079
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background: Temporal variability in health-care processes or protocols is intrinsic to medicine. Such variability can potentially introduce dataset shifts, a data quality issue when reusing electronic health records (EHRs) for secondary purposes. Temporal data-set shifts can present as trends, as well as abrupt or seasonal changes in the statistical distributions of data over time. The latter are particularly complicated to address in multimodal and highly coded data. These changes, if not delineated, can harm population and data-driven research, such as machine learning. Given that biomedical research repositories are increasingly being populated with large sets of historical data from EHRs, there is a need for specific software methods to help delineate temporal data-set shifts to ensure reliable data reuse. Results: EHRtemporalVariability is an open-source R package and Shiny app designed to explore and identify temporal data-set shifts. EHRtemporalVariability estimates the statistical distributions of coded and numerical data over time; projects their temporal evolution through non-parametric information geometric temporal plots; and enables the exploration of changes in variables through data temporal heat maps. We demonstrate the capability of EHRtemporalVariability to delineate data-set shifts in three impact case studies, one of which is available for reproducibility. Conclusions: EHRtemporalVariability enables the exploration and identification of data-set shifts, contributing to the broad examination and repurposing of large, longitudinal data sets. Our goal is to help ensure reliable data reuse for a wide range of biomedical data users. EHRtemporalVariability is designed for technical users who are programmatically utilizing the R package, as well as users who are not familiar with programming via the Shiny user interface.
引用
收藏
页数:7
相关论文
共 31 条
[1]  
Gewin V., Data sharing: An open mind on open data, Nature, 529, pp. 117-119, (2016)
[2]  
Katzan IL, Rudick RA., Time to integrate clinical and research informatics, Sci Transl Med, 4, (2012)
[3]  
Zhu L, Zheng WJ., Informatics, data science, and artificial intelligence, JAMA, 320, pp. 1103-1104, (2018)
[4]  
Rajkomar A, Dean J, Kohane I., Machine learning in medicine, N Engl J Med, 380, pp. 1347-1358, (2019)
[5]  
Andreu-Perez J, Poon CCY, Merrifield RD, Et al., Big data for health, IEEE J Biomed Health Inform, 19, pp. 1193-1208, (2015)
[6]  
Saez C, Rodrigues PP, Gama J, Et al., Probabilistic change detection and visualization methods for the assessment of temporal stability in biomedical data quality, Data Min Knowl Disc, 29, pp. 950-975, (2015)
[7]  
Schlegel DR, Ficheur G., Secondary use of patient data: Review of the literature published in 2016, Yearb Med Inform, 26, pp. 68-71, (2017)
[8]  
Agniel D, Kohane IS, Weber GM., Biases in electronic health record data due to processes within the healthcare system: Retrospective observational study, BMJ, 361, (2018)
[9]  
Saez C, Garcia-Gomez JM., Kinematics of big biomedical data to characterize temporal variability and seasonality of data repositories: Functional data analysis of data temporal evolution over non-parametric statistical manifolds, Int J Med Inform, 119, pp. 109-124, (2018)
[10]  
Leek JT, Scharpf RB, Bravo HC, Et al., Tackling the widespread and critical impact of batch effects in high-throughput data, Nat. Rev. Genet, 11, pp. 733-739, (2010)