High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP)

被引:95
作者
Zhang, Yichi [1 ]
Cai, Tianrun [2 ]
Yu, Sheng [3 ,4 ]
Cho, Kelly [5 ,6 ]
Hong, Chuan [1 ]
Sun, Jiehuan [1 ]
Huang, Jie [2 ]
Ho, Yuk-Lam [5 ]
Ananthakrishnan, Ashwin N. [7 ]
Xia, Zongqi [8 ]
Shaw, Stanley Y. [9 ]
Gainer, Vivian [10 ]
Castro, Victor [10 ]
Link, Nicholas [5 ]
Honerlaw, Jacqueline [5 ]
Huang, Sicong [2 ]
Gagnon, David [5 ,16 ]
Karlson, Elizabeth W. [2 ]
Plenge, Robert M. [2 ]
Szolovits, Peter [11 ]
Savova, Guergana [12 ]
Churchill, Susanne [13 ]
O'Donnell, Christopher [5 ,14 ]
Murphy, Shawn N. [10 ,13 ,15 ]
Gaziano, J. Michael [5 ,6 ]
Kohane, Isaac [13 ]
Cai, Tianxi [1 ,13 ]
Liao, Katherine P. [2 ,5 ,13 ]
机构
[1] Harvard TH Chan Sch Publ Hlth, Dept Biostat, Boston, MA USA
[2] Brigham & Womens Hosp, Div Rheumatol Immunol & Allergy, 75 Francis St, Boston, MA 02115 USA
[3] Tsinghua Univ, Ctr Stat Sci, Beijing, Peoples R China
[4] Tsinghua Univ, Dept Ind Engn, Beijing, Peoples R China
[5] VA Boston Healthcare Syst, Div Data Sci, Boston, MA 02130 USA
[6] Brigham & Womens Hosp, Div Aging, 75 Francis St, Boston, MA 02115 USA
[7] Massachusetts Gen Hosp, Dept Gastroenterol, Boston, MA 02114 USA
[8] Univ Pittsburgh, Dept Neurol, Pittsburgh, PA 15260 USA
[9] Brigham & Womens Hosp, Div Cardiovasc Med, 75 Francis St, Boston, MA 02115 USA
[10] Partners Healthcare, Res Informat Sci & Comp, Boston, MA USA
[11] MIT, Dept Elect Engn & Comp Sci, Cambridge, MA 02139 USA
[12] Boston Childrens Hosp, Computat Hlth Informat Program, Boston, MA USA
[13] Harvard Med Sch, Dept Biomed Informat, Boston, MA 02115 USA
[14] VA Boston Healthcare Syst, Div Cardiol, Boston, MA USA
[15] Massachusetts Gen Hosp, Dept Neurol, Boston, MA 02114 USA
[16] Boston Univ, Dept Biostat, Boston, MA 02215 USA
关键词
IDENTIFY RHEUMATOID-ARTHRITIS; PHENOME-WIDE ASSOCIATION; LARGE-SCALE; VITAMIN-D; HEALTH; RISK; IDENTIFICATION; INFORMATICS; EXTRACTION; ALGORITHM;
D O I
10.1038/s41596-019-0227-6
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Phenotypes are the foundation for clinical and genetic studies of disease risk and outcomes. The growth of biobanks linked to electronic medical record (EMR) data has both facilitated and increased the demand for efficient, accurate, and robust approaches for phenotyping millions of patients. Challenges to phenotyping with EMR data include variation in the accuracy of codes, as well as the high level of manual input required to identify features for the algorithm and to obtain gold standard labels. To address these challenges, we developed PheCAP, a high-throughput semi-supervised phenotyping pipeline. PheCAP begins with data from the EMR, including structured data and information extracted from the narrative notes using natural language processing (NLP). The standardized steps integrate automated procedures, which reduce the level of manual input, and machine learning approaches for algorithm training. PheCAP itself can be executed in 1-2 d if all data are available; however, the timing is largely dependent on the chart review stage, which typically requires at least 2 weeks. The final products of PheCAP include a phenotype algorithm, the probability of the phenotype for all patients, and a phenotype classification (yes or no).
引用
收藏
页码:3426 / 3444
页数:19
相关论文
共 47 条
[1]   Learning statistical models of phenotypes using noisy labeled training data [J].
Agarwal, Vibhu ;
Podchiyska, Tanya ;
Banda, Juan M. ;
Goel, Veena ;
Leung, Tiffany I. ;
Minty, Evan P. ;
Sweeney, Timothy E. ;
Gyang, Elsie ;
Shah, Nigam H. .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2016, 23 (06) :1166-1173
[2]   Common Genetic Variants Influence Circulating Vitamin D Levels in Inflammatory Bowel Diseases [J].
Ananthakrishnan, Ashwin N. ;
Cagan, Andrew ;
Cai, Tianxi ;
Gainer, Vivian S. ;
Shaw, Stanley Y. ;
Churchill, Susanne ;
Karlson, Elizabeth W. ;
Murphy, Shawn N. ;
Kohane, Isaac ;
Liao, Katherine P. ;
Xavier, Ramnik J. .
INFLAMMATORY BOWEL DISEASES, 2015, 21 (11) :2507-2514
[3]   Association Between Reduced Plasma 25-Hydroxy Vitamin D and Increased Risk of Cancer in Patients With Inflammatory Bowel Diseases [J].
Ananthakrishnan, Ashwin N. ;
Cheng, Su-Chun ;
Cai, Tianxi ;
Cagan, Andrew ;
Gainer, Vivian S. ;
Szolovits, Peter ;
Shaw, Stanley Y. ;
Churchill, Susanne ;
Karlson, Elizabeth W. ;
Murphy, Shawn N. ;
Kohane, Isaac ;
Liao, Katherine P. .
CLINICAL GASTROENTEROLOGY AND HEPATOLOGY, 2014, 12 (05) :821-827
[4]   Improving Case Definition of Crohn's Disease and Ulcerative Colitis in Electronic Medical Records Using Natural Language Processing: A Novel Informatics Approach [J].
Ananthakrishnan, Ashwin N. ;
Cai, Tianxi ;
Savova, Guergana ;
Cheng, Su-Chun ;
Chen, Pei ;
Perez, Raul Guzman ;
Gainer, Vivian S. ;
Murphy, Shawn N. ;
Szolovits, Peter ;
Xia, Zongqi ;
Shaw, Stanley ;
Churchill, Susanne ;
Karlson, Elizabeth W. ;
Kohane, Isaac ;
Plenge, Robert M. ;
Liao, Katherine P. .
INFLAMMATORY BOWEL DISEASES, 2013, 19 (07) :1411-1420
[5]  
Aronson AR, 2001, J AM MED INFORM ASSN, P17
[6]  
Banda Juan M, 2017, AMIA Jt Summits Transl Sci Proc, V2017, P48
[7]   Informatics and machine learning to define the phenotype [J].
Basile, Anna Okula ;
Ritchie, Marylyn DeRiggi .
EXPERT REVIEW OF MOLECULAR DIAGNOSTICS, 2018, 18 (03) :219-226
[8]   Rapid Identification of Myocardial Infarction Risk Associated With Diabetes Medications Using Electronic Medical Records [J].
Brownstein, John S. ;
Murphy, Shawn N. ;
Goldfine, Allison B. ;
Grant, Richard W. ;
Sordo, Margarita ;
Gainer, Vivian ;
Colecchi, Judith A. ;
Dubey, Anil ;
Nathan, David M. ;
Glaser, John P. ;
Kohane, Isaac S. .
DIABETES CARE, 2010, 33 (03) :526-531
[9]   The Association Between Arthralgia and Vedolizumab Using Natural Language Processing [J].
Cai, Tianrun ;
Lin, Tzu-Chieh ;
Bond, Allison ;
Huang, Jie ;
Kane-Wanger, Gwendolyn ;
Cagan, Andrew ;
Murphy, Shawn N. ;
Ananthakrishnan, Ashwin N. ;
Liao, Katherine P. .
INFLAMMATORY BOWEL DISEASES, 2018, 24 (10) :2242-2246
[10]   An atlas of genetic associations in UK Biobank [J].
Canela-Xandri, Oriol ;
Rawlik, Konrad ;
Tenesa, Albert .
NATURE GENETICS, 2018, 50 (11) :1593-+