FIBER: enabling flexible retrieval of electronic health records data for clinical predictive modeling

被引:1
作者
Datta, Suparno [1 ,2 ]
Sachs, Jan Philipp [1 ,2 ]
Cruz, Harry FreitasDa [1 ,2 ]
Martensen, Tom [1 ]
Bode, Philipp [1 ]
Sasso, Ariane Morassi [1 ,2 ]
Glicksberg, Benjamin S. [2 ,3 ]
Boettinger, Erwin [1 ,2 ]
机构
[1] Univ Potsdam, Digital Hlth Ctr, Hasso Plattner Inst, Rudolf Breitscheid Str 187, D-14482 Potsdam, Germany
[2] Icahn Sch Med Mt Sinai, Hasso Plattner Inst Digital Hlth Mt Sinai, New York, NY 10029 USA
[3] Icahn Sch Med Mt Sinai, Dept Genet & Genom Sci, New York, NY 10029 USA
基金
美国国家卫生研究院; 欧盟地平线“2020”;
关键词
databases; factual; electronic health records; information storage and retrieval; workflow; software/instrumentation; ENTERPRISE;
D O I
10.1093/jamiaopen/ooab048
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Objectives: The development of clinical predictive models hinges upon the availability of comprehensive clinical data. Tapping into such resources requires considerable effort from clinicians, data scientists, and engineers. Specifically, these efforts are focused on data extraction and preprocessing steps required prior to modeling, including complex database queries. A handful of software libraries exist that can reduce this complexity by building upon data standards. However, a gap remains concerning electronic health records (EHRs) stored in star schema clinical data warehouses, an approach often adopted in practice. In this article, we introduce the FlexIBle EHR Retrieval (FIBER) tool: a Python library built on top of a star schema (i2b2) clinical data warehouse that enables flexible generation of modeling-ready cohorts as data frames. Materials and Methods: FIBER was developed on top of a large-scale star schema EHR database which contains data from 8 million patients and over 120 million encounters. To illustrate FIBER's capabilities, we present its application by building a heart surgery patient cohort with subsequent prediction of acute kidney injury (AKI) with various machine learning models. Results: Using FIBER, we were able to build the heart surgery cohort (n = 12 061), identify the patients that developed AKI (n = 1005), and automatically extract relevant features (n = 774). Finally, we trained machine learning models that achieved area under the curve values of up to 0.77 for this exemplary use case. Conclusion: FIBER is an open-source Python library developed for extracting information from star schema clinical data warehouses and reduces time-to-modeling, helping to streamline the clinical modeling process.
引用
收藏
页数:10
相关论文
共 38 条
  • [1] Badger J., INSPECTOMOP
  • [2] Bayer M., 2012, ARCHITECTURE OPEN SO
  • [3] Bender D, 2013, COMP MED SY, P326, DOI 10.1109/CBMS.2013.6627810
  • [4] A Robust e-Epidemiology Tool in Phenotyping Heart Failure with Differentiation for Preserved and Reduced Ejection Fraction: the Electronic Medical Records and Genomics (eMERGE) Network
    Bielinski, Suzette J.
    Pathak, Jyotishman
    Carrell, David S.
    Takahashi, Paul Y.
    Olson, Janet E.
    Larson, Nicholas B.
    Liu, Hongfang
    Sohn, Sunghwan
    Wells, Quinn S.
    Denny, Joshua C.
    Rasmussen-Torvik, Laura J.
    Pacheco, Jennifer Allen
    Jackson, Kathryn L.
    Lesnick, Timothy G.
    Gullerud, Rachel E.
    Decker, Paul A.
    Pereira, Naveen L.
    Ryu, Euijung
    Dart, Richard A.
    Peissig, Peggy
    Linneman, James G.
    Jarvik, Gail P.
    Larson, Eric B.
    Bock, Jonathan A.
    Tromp, Gerard C.
    de Andrade, Mariza
    Roger, Veronique L.
    [J]. JOURNAL OF CARDIOVASCULAR TRANSLATIONAL RESEARCH, 2015, 8 (08) : 475 - 483
  • [5] Bisong E., 2019, Matplotlib and seaborn. Building machine learning and deep learning models on google cloud platform: A comprehensive guide for beginners, P151, DOI DOI 10.1007/978-1-4842-4470-8_12
  • [6] JSON']JSON: Data model, Query languages and Schema specification
    Bourhis, Pierre
    Reutter, Juan L.
    Suarez, Fernando
    Vrgoc, Domagoj
    [J]. PODS'17: PROCEEDINGS OF THE 36TH ACM SIGMOD-SIGACT-SIGAI SYMPOSIUM ON PRINCIPLES OF DATABASE SYSTEMS, 2017, : 123 - 135
  • [7] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [8] XGBoost: A Scalable Tree Boosting System
    Chen, Tianqi
    Guestrin, Carlos
    [J]. KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, : 785 - 794
  • [9] De Moor G, 2010, EHEALTH WOHIT C BARC
  • [10] UK phenomics platform for developing and validating electronic health record phenotypes: CALIBER
    Denaxas, Spiros
    Gonzalez-Izquierdo, Arturo
    Direk, Kenan
    Fitzpatrick, Natalie K.
    Fatemifar, Ghazaleh
    Banerjee, Amitava
    Dobson, Richard J. B.
    Howe, Laurence J.
    Kuan, Valerie
    Lumbers, R. Tom
    Pasea, Laura
    Patel, Riyaz S.
    Shah, Anoop D.
    Hingorani, Aroon D.
    Sudlow, Cathie
    Hemingway, Harry
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2019, 26 (12) : 1545 - 1559