High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP)

被引:87
作者
Zhang, Yichi [1 ]
Cai, Tianrun [2 ]
Yu, Sheng [3 ,4 ]
Cho, Kelly [5 ,6 ]
Hong, Chuan [1 ]
Sun, Jiehuan [1 ]
Huang, Jie [2 ]
Ho, Yuk-Lam [5 ]
Ananthakrishnan, Ashwin N. [7 ]
Xia, Zongqi [8 ]
Shaw, Stanley Y. [9 ]
Gainer, Vivian [10 ]
Castro, Victor [10 ]
Link, Nicholas [5 ]
Honerlaw, Jacqueline [5 ]
Huang, Sicong [2 ]
Gagnon, David [5 ,16 ]
Karlson, Elizabeth W. [2 ]
Plenge, Robert M. [2 ]
Szolovits, Peter [11 ]
Savova, Guergana [12 ]
Churchill, Susanne [13 ]
O'Donnell, Christopher [5 ,14 ]
Murphy, Shawn N. [10 ,13 ,15 ]
Gaziano, J. Michael [5 ,6 ]
Kohane, Isaac [13 ]
Cai, Tianxi [1 ,13 ]
Liao, Katherine P. [2 ,5 ,13 ]
机构
[1] Harvard TH Chan Sch Publ Hlth, Dept Biostat, Boston, MA USA
[2] Brigham & Womens Hosp, Div Rheumatol Immunol & Allergy, 75 Francis St, Boston, MA 02115 USA
[3] Tsinghua Univ, Ctr Stat Sci, Beijing, Peoples R China
[4] Tsinghua Univ, Dept Ind Engn, Beijing, Peoples R China
[5] VA Boston Healthcare Syst, Div Data Sci, Boston, MA 02130 USA
[6] Brigham & Womens Hosp, Div Aging, 75 Francis St, Boston, MA 02115 USA
[7] Massachusetts Gen Hosp, Dept Gastroenterol, Boston, MA 02114 USA
[8] Univ Pittsburgh, Dept Neurol, Pittsburgh, PA 15260 USA
[9] Brigham & Womens Hosp, Div Cardiovasc Med, 75 Francis St, Boston, MA 02115 USA
[10] Partners Healthcare, Res Informat Sci & Comp, Boston, MA USA
[11] MIT, Dept Elect Engn & Comp Sci, Cambridge, MA 02139 USA
[12] Boston Childrens Hosp, Computat Hlth Informat Program, Boston, MA USA
[13] Harvard Med Sch, Dept Biomed Informat, Boston, MA 02115 USA
[14] VA Boston Healthcare Syst, Div Cardiol, Boston, MA USA
[15] Massachusetts Gen Hosp, Dept Neurol, Boston, MA 02114 USA
[16] Boston Univ, Dept Biostat, Boston, MA 02215 USA
关键词
IDENTIFY RHEUMATOID-ARTHRITIS; PHENOME-WIDE ASSOCIATION; LARGE-SCALE; VITAMIN-D; HEALTH; RISK; IDENTIFICATION; INFORMATICS; EXTRACTION; ALGORITHM;
D O I
10.1038/s41596-019-0227-6
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Phenotypes are the foundation for clinical and genetic studies of disease risk and outcomes. The growth of biobanks linked to electronic medical record (EMR) data has both facilitated and increased the demand for efficient, accurate, and robust approaches for phenotyping millions of patients. Challenges to phenotyping with EMR data include variation in the accuracy of codes, as well as the high level of manual input required to identify features for the algorithm and to obtain gold standard labels. To address these challenges, we developed PheCAP, a high-throughput semi-supervised phenotyping pipeline. PheCAP begins with data from the EMR, including structured data and information extracted from the narrative notes using natural language processing (NLP). The standardized steps integrate automated procedures, which reduce the level of manual input, and machine learning approaches for algorithm training. PheCAP itself can be executed in 1-2 d if all data are available; however, the timing is largely dependent on the chart review stage, which typically requires at least 2 weeks. The final products of PheCAP include a phenotype algorithm, the probability of the phenotype for all patients, and a phenotype classification (yes or no).
引用
收藏
页码:3426 / 3444
页数:19
相关论文
共 47 条
  • [1] Learning statistical models of phenotypes using noisy labeled training data
    Agarwal, Vibhu
    Podchiyska, Tanya
    Banda, Juan M.
    Goel, Veena
    Leung, Tiffany I.
    Minty, Evan P.
    Sweeney, Timothy E.
    Gyang, Elsie
    Shah, Nigam H.
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2016, 23 (06) : 1166 - 1173
  • [2] Common Genetic Variants Influence Circulating Vitamin D Levels in Inflammatory Bowel Diseases
    Ananthakrishnan, Ashwin N.
    Cagan, Andrew
    Cai, Tianxi
    Gainer, Vivian S.
    Shaw, Stanley Y.
    Churchill, Susanne
    Karlson, Elizabeth W.
    Murphy, Shawn N.
    Kohane, Isaac
    Liao, Katherine P.
    Xavier, Ramnik J.
    [J]. INFLAMMATORY BOWEL DISEASES, 2015, 21 (11) : 2507 - 2514
  • [3] Association Between Reduced Plasma 25-Hydroxy Vitamin D and Increased Risk of Cancer in Patients With Inflammatory Bowel Diseases
    Ananthakrishnan, Ashwin N.
    Cheng, Su-Chun
    Cai, Tianxi
    Cagan, Andrew
    Gainer, Vivian S.
    Szolovits, Peter
    Shaw, Stanley Y.
    Churchill, Susanne
    Karlson, Elizabeth W.
    Murphy, Shawn N.
    Kohane, Isaac
    Liao, Katherine P.
    [J]. CLINICAL GASTROENTEROLOGY AND HEPATOLOGY, 2014, 12 (05) : 821 - 827
  • [4] Improving Case Definition of Crohn's Disease and Ulcerative Colitis in Electronic Medical Records Using Natural Language Processing: A Novel Informatics Approach
    Ananthakrishnan, Ashwin N.
    Cai, Tianxi
    Savova, Guergana
    Cheng, Su-Chun
    Chen, Pei
    Perez, Raul Guzman
    Gainer, Vivian S.
    Murphy, Shawn N.
    Szolovits, Peter
    Xia, Zongqi
    Shaw, Stanley
    Churchill, Susanne
    Karlson, Elizabeth W.
    Kohane, Isaac
    Plenge, Robert M.
    Liao, Katherine P.
    [J]. INFLAMMATORY BOWEL DISEASES, 2013, 19 (07) : 1411 - 1420
  • [5] Aronson AR, 2001, J AM MED INFORM ASSN, P17
  • [6] Banda Juan M, 2017, AMIA Jt Summits Transl Sci Proc, V2017, P48
  • [7] Informatics and machine learning to define the phenotype
    Basile, Anna Okula
    Ritchie, Marylyn DeRiggi
    [J]. EXPERT REVIEW OF MOLECULAR DIAGNOSTICS, 2018, 18 (03) : 219 - 226
  • [8] Rapid Identification of Myocardial Infarction Risk Associated With Diabetes Medications Using Electronic Medical Records
    Brownstein, John S.
    Murphy, Shawn N.
    Goldfine, Allison B.
    Grant, Richard W.
    Sordo, Margarita
    Gainer, Vivian
    Colecchi, Judith A.
    Dubey, Anil
    Nathan, David M.
    Glaser, John P.
    Kohane, Isaac S.
    [J]. DIABETES CARE, 2010, 33 (03) : 526 - 531
  • [9] The Association Between Arthralgia and Vedolizumab Using Natural Language Processing
    Cai, Tianrun
    Lin, Tzu-Chieh
    Bond, Allison
    Huang, Jie
    Kane-Wanger, Gwendolyn
    Cagan, Andrew
    Murphy, Shawn N.
    Ananthakrishnan, Ashwin N.
    Liao, Katherine P.
    [J]. INFLAMMATORY BOWEL DISEASES, 2018, 24 (10) : 2242 - 2246
  • [10] An atlas of genetic associations in UK Biobank
    Canela-Xandri, Oriol
    Rawlik, Konrad
    Tenesa, Albert
    [J]. NATURE GENETICS, 2018, 50 (11) : 1593 - +