Surrogate-assisted feature extraction for high-throughput phenotyping

被引:52
作者
Yu, Sheng [1 ,2 ]
Chakrabortty, Abhishek [3 ]
Liao, Katherine P. [4 ]
Cai, Tianrun [5 ]
Ananthakrishnan, Ashwin N. [6 ]
Gainer, Vivian S. [7 ]
Churchill, Susanne E. [8 ]
Szolovits, Peter [9 ]
Murphy, Shawn N. [7 ,10 ]
Kohane, Isaac S. [8 ]
Cai, Tianxi [3 ]
机构
[1] Tsinghua Univ, Ctr Stat Sci, Beijing, Peoples R China
[2] Tsinghua Univ, Dept Ind Engn, Beijing, Peoples R China
[3] Harvard TH Chan Sch Publ Hlth, Dept Biostat, Boston, MA USA
[4] Brigham & Womens Hosp, Div Rheumatol, 75 Francis St, Boston, MA 02115 USA
[5] Brigham & Womens Hosp, Dept Radiol, 75 Francis St, Boston, MA 02115 USA
[6] Massachusetts Gen Hosp, Div Gastroenterol, Boston, MA 02114 USA
[7] Partners HealthCare, Res IS & Comp, Charlestown, MA USA
[8] Harvard Med Sch, Dept Biomed Informat, Boston, MA USA
[9] MIT, Comp Sci & Artificial Intelligence Lab, 77 Massachusetts Ave, Cambridge, MA 02139 USA
[10] Massachusetts Gen Hosp, Dept Neurol, Boston, MA 02114 USA
基金
美国国家卫生研究院;
关键词
electronic medical records; phenotyping; data mining; machine learning; ELECTRONIC MEDICAL-RECORDS; HEALTH RECORDS; RHEUMATOID-ARTHRITIS; EMERGE NETWORK; ICD-9-CM CODES; RISK; DISEASE; MORTALITY; CLASSIFICATION; IDENTIFICATION;
D O I
10.1093/jamia/ocw135
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective: Phenotyping algorithms are capable of accurately identifying patients with specific phenotypes from within electronic medical records systems. However, developing phenotyping algorithms in a scalable way remains a challenge due to the extensive human resources required. This paper introduces a high-throughput unsupervised feature selection method, which improves the robustness and scalability of electronic medical record phenotyping without compromising its accuracy. Methods: The proposed Surrogate-Assisted Feature Extraction (SAFE) method selects candidate features from a pool of comprehensive medical concepts found in publicly available knowledge sources. The target phenotype's International Classification of Diseases, Ninth Revision and natural language processing counts, acting as noisy surrogates to the gold-standard labels, are used to create silver-standard labels. Candidate features highly predictive of the silver-standard labels are selected as the final features. Results: Algorithms were trained to identify patients with coronary artery disease, rheumatoid arthritis, Crohn's disease, and ulcerative colitis using various numbers of labels to compare the performance of features selected by SAFE, a previously published automated feature extraction for phenotyping procedure, and domain experts. The out-of-sample area under the receiver operating characteristic curve and F-score from SAFE algorithms were remarkably higher than those from the other two, especially at small label sizes. Conclusion: SAFE advances high-throughput phenotyping methods by automatically selecting a succinct set of informative features for algorithm training, which in turn reduces overfitting and the needed number of goldstandard labels. SAFE also potentially identifies important features missed by automated feature extraction for phenotyping or experts.
引用
收藏
页码:E143 / E149
页数:7
相关论文
共 44 条
  • [1] Improving Case Definition of Crohn's Disease and Ulcerative Colitis in Electronic Medical Records Using Natural Language Processing: A Novel Informatics Approach
    Ananthakrishnan, Ashwin N.
    Cai, Tianxi
    Savova, Guergana
    Cheng, Su-Chun
    Chen, Pei
    Perez, Raul Guzman
    Gainer, Vivian S.
    Murphy, Shawn N.
    Szolovits, Peter
    Xia, Zongqi
    Shaw, Stanley
    Churchill, Susanne
    Karlson, Elizabeth W.
    Kohane, Isaac
    Plenge, Robert M.
    Liao, Katherine P.
    [J]. INFLAMMATORY BOWEL DISEASES, 2013, 19 (07) : 1411 - 1420
  • [2] [Anonymous], J AM MED INFORM ASS
  • [3] Inaccuracy of the International Classification of Diseases (ICD-9-CM) in identifying the diagnosis of ischemic cerebrovascular disease
    Benesch, C
    Witter, DM
    Wilder, AL
    Duncan, PW
    Samsa, GP
    Matchar, DB
    [J]. NEUROLOGY, 1997, 49 (03) : 660 - 664
  • [4] Accuracy of ICD-9-CM codes for identifying cardiovascular and stroke risk factors
    Birman-Deych, E
    Waterman, AD
    Yan, Y
    Nilasena, DS
    Radford, MJ
    Gage, BF
    [J]. MEDICAL CARE, 2005, 43 (05) : 480 - 485
  • [5] Portability of an algorithm to identify rheumatoid arthritis in electronic health records
    Carroll, Robert J.
    Thompson, Will K.
    Eyler, Anne E.
    Mandelin, Arthur M.
    Cai, Tianxi
    Zink, Raquel M.
    Pacheco, Jennifer A.
    Boomershine, Chad S.
    Lasko, Thomas A.
    Xu, Hua
    Karlson, Elizabeth W.
    Perez, Raul G.
    Gainer, Vivian S.
    Murphy, Shawn N.
    Ruderman, Eric M.
    Pope, Richard M.
    Plenge, Robert M.
    Kho, Abel Ngo
    Liao, Katherine P.
    Denny, Joshua C.
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2012, 19 (E1) : E162 - E169
  • [6] Carroll Robert J, 2011, AMIA Annu Symp Proc, V2011, P189
  • [7] Identification of subjects with polycystic ovary syndrome using electronic health records
    Castro, Victor
    Shen, Yuanyuan
    Yu, Sheng
    Finan, Sean
    Pau, Cindy Ta
    Gainer, Vivian
    Keefe, Candace C.
    Savova, Guergana
    Murphy, Shawn N.
    Cai, Tianxi
    Welt, Corrine K.
    [J]. REPRODUCTIVE BIOLOGY AND ENDOCRINOLOGY, 2015, 13
  • [8] Validation of Electronic Health Record Phenotyping of Bipolar Disorder Cases and Controls
    Castro, Victor M.
    Minnier, Jessica
    Murphy, Shawn N.
    Kohane, Isaac
    Churchill, Susanne E.
    Gainer, Vivian
    Cai, Tianxi
    Hoffnagle, Alison G.
    Dai, Yael
    Block, Stefanie
    Weill, Sydney R.
    Nadal-Vicens, Mireya
    Pollastri, Alisha R.
    Rosenquist, J. Niels
    Goryachev, Sergey
    Ongur, Dost
    Sklar, Pamela
    Perlis, Roy H.
    Smoller, Jordan W.
    [J]. AMERICAN JOURNAL OF PSYCHIATRY, 2015, 172 (04) : 363 - 372
  • [9] QT interval and antidepressant use: a cross sectional study of electronic health records
    Castro, Victor M.
    Clements, Caitlin C.
    Murphy, Shawn N.
    Gainer, Vivian S.
    Fava, Maurizio
    Weilburg, Jeffrey B.
    Erb, Jane L.
    Churchill, Susanne E.
    Kohane, Isaac S.
    Iosifescu, Dan V.
    Smoller, Jordan W.
    Perlis, Roy H.
    [J]. BMJ-BRITISH MEDICAL JOURNAL, 2013, 346
  • [10] Conway Mike, 2011, AMIA Annu Symp Proc, V2011, P274