High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP)

被引:87
作者
Zhang, Yichi [1 ]
Cai, Tianrun [2 ]
Yu, Sheng [3 ,4 ]
Cho, Kelly [5 ,6 ]
Hong, Chuan [1 ]
Sun, Jiehuan [1 ]
Huang, Jie [2 ]
Ho, Yuk-Lam [5 ]
Ananthakrishnan, Ashwin N. [7 ]
Xia, Zongqi [8 ]
Shaw, Stanley Y. [9 ]
Gainer, Vivian [10 ]
Castro, Victor [10 ]
Link, Nicholas [5 ]
Honerlaw, Jacqueline [5 ]
Huang, Sicong [2 ]
Gagnon, David [5 ,16 ]
Karlson, Elizabeth W. [2 ]
Plenge, Robert M. [2 ]
Szolovits, Peter [11 ]
Savova, Guergana [12 ]
Churchill, Susanne [13 ]
O'Donnell, Christopher [5 ,14 ]
Murphy, Shawn N. [10 ,13 ,15 ]
Gaziano, J. Michael [5 ,6 ]
Kohane, Isaac [13 ]
Cai, Tianxi [1 ,13 ]
Liao, Katherine P. [2 ,5 ,13 ]
机构
[1] Harvard TH Chan Sch Publ Hlth, Dept Biostat, Boston, MA USA
[2] Brigham & Womens Hosp, Div Rheumatol Immunol & Allergy, 75 Francis St, Boston, MA 02115 USA
[3] Tsinghua Univ, Ctr Stat Sci, Beijing, Peoples R China
[4] Tsinghua Univ, Dept Ind Engn, Beijing, Peoples R China
[5] VA Boston Healthcare Syst, Div Data Sci, Boston, MA 02130 USA
[6] Brigham & Womens Hosp, Div Aging, 75 Francis St, Boston, MA 02115 USA
[7] Massachusetts Gen Hosp, Dept Gastroenterol, Boston, MA 02114 USA
[8] Univ Pittsburgh, Dept Neurol, Pittsburgh, PA 15260 USA
[9] Brigham & Womens Hosp, Div Cardiovasc Med, 75 Francis St, Boston, MA 02115 USA
[10] Partners Healthcare, Res Informat Sci & Comp, Boston, MA USA
[11] MIT, Dept Elect Engn & Comp Sci, Cambridge, MA 02139 USA
[12] Boston Childrens Hosp, Computat Hlth Informat Program, Boston, MA USA
[13] Harvard Med Sch, Dept Biomed Informat, Boston, MA 02115 USA
[14] VA Boston Healthcare Syst, Div Cardiol, Boston, MA USA
[15] Massachusetts Gen Hosp, Dept Neurol, Boston, MA 02114 USA
[16] Boston Univ, Dept Biostat, Boston, MA 02215 USA
关键词
IDENTIFY RHEUMATOID-ARTHRITIS; PHENOME-WIDE ASSOCIATION; LARGE-SCALE; VITAMIN-D; HEALTH; RISK; IDENTIFICATION; INFORMATICS; EXTRACTION; ALGORITHM;
D O I
10.1038/s41596-019-0227-6
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Phenotypes are the foundation for clinical and genetic studies of disease risk and outcomes. The growth of biobanks linked to electronic medical record (EMR) data has both facilitated and increased the demand for efficient, accurate, and robust approaches for phenotyping millions of patients. Challenges to phenotyping with EMR data include variation in the accuracy of codes, as well as the high level of manual input required to identify features for the algorithm and to obtain gold standard labels. To address these challenges, we developed PheCAP, a high-throughput semi-supervised phenotyping pipeline. PheCAP begins with data from the EMR, including structured data and information extracted from the narrative notes using natural language processing (NLP). The standardized steps integrate automated procedures, which reduce the level of manual input, and machine learning approaches for algorithm training. PheCAP itself can be executed in 1-2 d if all data are available; however, the timing is largely dependent on the chart review stage, which typically requires at least 2 weeks. The final products of PheCAP include a phenotype algorithm, the probability of the phenotype for all patients, and a phenotype classification (yes or no).
引用
收藏
页码:3426 / 3444
页数:19
相关论文
共 47 条
  • [31] Electronic Medical Records for Discovery Research in Rheumatoid Arthritis
    Liao, Katherine P.
    Cai, Tianxi
    Gainer, Vivian
    Goryachev, Sergey
    Zeng-Treitler, Qing
    Raychaudhuri, Soumya
    Szolovits, Peter
    Churchill, Susanne
    Murphy, Shawn
    Kohane, Isaac
    Karlson, Elizabeth W.
    Plenge, Robert M.
    [J]. ARTHRITIS CARE & RESEARCH, 2010, 62 (08) : 1120 - 1127
  • [32] THE UNIFIED MEDICAL LANGUAGE SYSTEM
    LINDBERG, DAB
    HUMPHREYS, BL
    MCCRAY, AT
    [J]. METHODS OF INFORMATION IN MEDICINE, 1993, 32 (04) : 281 - 291
  • [33] Liu H. D., 2013, CEUR WORKSHOP PROC, V1179
  • [34] The Stanford CoreNLP Natural Language Processing Toolkit
    Manning, Christopher D.
    Surdeanu, Mihai
    Bauer, John
    Finkel, Jenny
    Bethard, Steven J.
    McClosky, David
    [J]. PROCEEDINGS OF 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: SYSTEM DEMONSTRATIONS, 2014, : 55 - 60
  • [35] Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2)
    Murphy, Shawn N.
    Weber, Griffin
    Mendis, Michael
    Gainer, Vivian
    Chueh, Henry C.
    Churchill, Susanne
    Kohane, Isaac
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2010, 17 (02) : 124 - 130
  • [36] Measuring diagnoses: ICD code accuracy
    O'Malley, KJ
    Cook, KF
    Price, MD
    Wildes, KR
    Hurdle, JF
    Ashton, CM
    [J]. HEALTH SERVICES RESEARCH, 2005, 40 (05) : 1620 - 1639
  • [37] Genetics of rheumatoid arthritis contributes to biology and drug discovery
    Okada, Yukinori
    Wu, Di
    Trynka, Gosia
    Raj, Towfique
    Terao, Chikashi
    Ikari, Katsunori
    Kochi, Yuta
    Ohmura, Koichiro
    Suzuki, Akari
    Yoshida, Shinji
    Graham, Robert R.
    Manoharan, Arun
    Ortmann, Ward
    Bhangale, Tushar
    Denny, Joshua C.
    Carroll, Robert J.
    Eyler, Anne E.
    Greenberg, Jeffrey D.
    Kremer, Joel M.
    Pappas, Dimitrios A.
    Jiang, Lei
    Yin, Jian
    Ye, Lingying
    Su, Ding-Feng
    Yang, Jian
    Xie, Gang
    Keystone, Ed
    Westra, Harm-Jan
    Esko, Tonu
    Metspalu, Andres
    Zhou, Xuezhong
    Gupta, Namrata
    Mirel, Daniel
    Stahl, Eli A.
    Diogo, Dorothee
    Cui, Jing
    Liao, Katherine
    Guo, Michael H.
    Myouzen, Keiko
    Kawaguchi, Takahisa
    Coenen, Marieke J. H.
    van Riel, Piet L. C. M.
    van de laar, Mart A. F. J.
    Guchelaar, Henk-Jan
    Huizinga, Tom W. J.
    Dieude, Philippe
    Mariette, Xavier
    Bridges, S. Louis, Jr.
    Zhernakova, Alexandra
    Toes, Rene E. M.
    [J]. NATURE, 2014, 506 (7488) : 376 - +
  • [38] Using electronic medical records to enable large-scale studies in psychiatry: treatment resistant depression as a model
    Perlis, R. H.
    Iosifescu, D. V.
    Castro, V. M.
    Murphy, S. N.
    Gainer, V. S.
    Minnier, J.
    Cai, T.
    Goryachev, S.
    Zeng, Q.
    Gallagher, P. J.
    Fava, M.
    Weilburg, J. B.
    Churchill, S. E.
    Kohane, I. S.
    Smoller, J. W.
    [J]. PSYCHOLOGICAL MEDICINE, 2012, 42 (01) : 41 - 50
  • [39] Design patterns for the development of electronic health record-driven phenotype extraction algorithms
    Rasmussen, Luke V.
    Thompson, Will K.
    Pacheco, Jennifer A.
    Kho, Abel N.
    Carrell, David S.
    Pathak, Jyotishman
    Peissig, Peggy L.
    Tromp, Gerard
    Denny, Joshua C.
    Starren, Justin B.
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2014, 51 : 280 - 286
  • [40] Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications
    Savova, Guergana K.
    Masanz, James J.
    Ogren, Philip V.
    Zheng, Jiaping
    Sohn, Sunghwan
    Kipper-Schuler, Karin C.
    Chute, Christopher G.
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2010, 17 (05) : 507 - 513