Machine learning approaches for electronic health records phenotyping: a methodical review

被引:42
作者
Yang, Siyue [1 ]
Varghese, Paul [2 ]
Stephenson, Ellen [3 ]
Tu, Karen [3 ]
Gronsbell, Jessica [1 ,3 ,4 ,5 ]
机构
[1] Univ Toronto, Dept Stat Sci, Toronto, ON, Canada
[2] Verily Life Sci, Cambridge, MA USA
[3] Univ Toronto, Dept Family & Community Med, Toronto, ON, Canada
[4] Univ Toronto, Dept Comp Sci, Toronto, ON, Canada
[5] Univ Toronto, Dept Stat Sci, 700 Univ Ave, Toronto, ON M5G 1Z5, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
electronic health records; phenotyping; cohort identification; machine learning; CLINICAL-TRIALS; INFORMATION; VALIDATION; ALGORITHMS; EXTRACTION; SELECTION; MODEL; TEXT;
D O I
10.1093/jamia/ocac216
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective Accurate and rapid phenotyping is a prerequisite to leveraging electronic health records for biomedical research. While early phenotyping relied on rule-based algorithms curated by experts, machine learning (ML) approaches have emerged as an alternative to improve scalability across phenotypes and healthcare settings. This study evaluates ML-based phenotyping with respect to (1) the data sources used, (2) the phenotypes considered, (3) the methods applied, and (4) the reporting and evaluation methods used. Materials and methods We searched PubMed and Web of Science for articles published between 2018 and 2022. After screening 850 articles, we recorded 37 variables on 100 studies. Results Most studies utilized data from a single institution and included information in clinical notes. Although chronic conditions were most commonly considered, ML also enabled the characterization of nuanced phenotypes such as social determinants of health. Supervised deep learning was the most popular ML paradigm, while semi-supervised and weakly supervised learning were applied to expedite algorithm development and unsupervised learning to facilitate phenotype discovery. ML approaches did not uniformly outperform rule-based algorithms, but deep learning offered a marginal improvement over traditional ML for many conditions. Discussion Despite the progress in ML-based phenotyping, most articles focused on binary phenotypes and few articles evaluated external validity or used multi-institution data. Study settings were infrequently reported and analytic code was rarely released. Conclusion Continued research in ML-based phenotyping is warranted, with emphasis on characterizing nuanced phenotypes, establishing reporting and evaluation standards, and developing methods to accommodate misclassified phenotypes due to algorithm errors in downstream applications.
引用
收藏
页码:367 / 381
页数:15
相关论文
共 163 条
  • [1] Subtypes in patients with opioid misuse: A prognostic enrichment strategy using electronic health record data in hospitalized patients
    Afshar, Majid
    Joyce, Cara
    Dligach, Dmitriy
    Sharma, Brihat
    Kania, Robert
    Xie, Meng
    Swope, Kristin
    Salisbury-Afshar, Elizabeth
    Karnik, Niranjan S.
    [J]. PLOS ONE, 2019, 14 (07):
  • [2] Afshar Majid, 2018, AMIA Annu Symp Proc, V2018, P157
  • [3] Natural language processing and machine learning to identify alcohol misuse from the electronic health record in trauma patients: development and internal validation
    Afshar, Majid
    Phillips, Andrew
    Karnik, Niranjan
    Mueller, Jeanne
    To, Daniel
    Gonzalez, Richard
    Price, Ron
    Cooper, Richard
    Joyce, Cara
    Dligach, Dmitriy
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2019, 26 (03) : 254 - 261
  • [4] Learning statistical models of phenotypes using noisy labeled training data
    Agarwal, Vibhu
    Podchiyska, Tanya
    Banda, Juan M.
    Goel, Veena
    Leung, Tiffany I.
    Minty, Evan P.
    Sweeney, Timothy E.
    Gyang, Elsie
    Shah, Nigam H.
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2016, 23 (06) : 1166 - 1173
  • [5] Ahuja Y., 2021, RES SQUARE
  • [6] sureLDA: A multidisease automated phenotyping method for the electronic health record
    Ahuja, Yuri
    Zhou, Doudou
    He, Zeling
    Sun, Jiehuan
    Castro, Victor M.
    Gainer, Vivian
    Murphy, Shawn N.
    Hong, Chuan
    Cai, Tianxi
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2020, 27 (08) : 1235 - 1243
  • [7] Alsentzer Emily, 2019, ARXIV
  • [8] A Review of Automatic Phenotyping Approaches using Electronic Health Records
    Alzoubi, Hadeel
    Alzubi, Raid
    Ramzan, Naeem
    West, Daune
    Al-Hadhrami, Tawfik
    Alazab, Mamoun
    [J]. ELECTRONICS, 2019, 8 (11)
  • [9] A natural language processing and deep learning approach to identify child abuse from pediatric electronic medical records
    Annapragada, Akshaya, V
    Donaruma-Kwoh, Marcella M.
    Annapragada, Ananth, V
    Starosolski, Zbigniew A.
    [J]. PLOS ONE, 2021, 16 (02):
  • [10] Apostolova Emilia, 2019, AMIA Annu Symp Proc, V2019, P228