PhenoTagger: a hybrid method for phenotype concept recognition using human phenotype ontology

被引:28
作者
Luo, Ling [1 ]
Yan, Shankai [1 ]
Lai, Po-Ting [1 ]
Veltri, Daniel [2 ]
Oler, Andrew [2 ]
Xirasagar, Sandhya [2 ]
Ghosh, Rajarshi [2 ]
Similuk, Morgan [2 ]
Robinson, Peter N. [3 ]
Lu, Zhiyong [1 ]
机构
[1] Natl Ctr Biotechnol Informat, NLM, NIH, Bethesda, MD 20894 USA
[2] NIAID, Bioinformat & Computat Biosci Branch, Off Cyber Infrastruct & Computat Biol, NIH, Bethesda, MD USA
[3] Jackson Lab Genom Med, Farmington, CT 06032 USA
基金
美国国家卫生研究院;
关键词
NORMALIZATION;
D O I
10.1093/bioinformatics/btab019
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Automatic phenotype concept recognition from unstructured text remains a challenging task in biomedical text mining research. Previous works that address the task typically use dictionary-based matching methods, which can achieve high precision but suffer from lower recall. Recently, machine learning-based methods have been proposed to identify biomedical concepts, which can recognize more unseen concept synonyms by automatic feature learning. However, most methods require large corpora of manually annotated data for model training, which is difficult to obtain due to the high cost of human annotation. Results: In this article, we propose PhenoTagger, a hybrid method that combines both dictionary and machine learning-based methods to recognize Human Phenotype Ontology (HPO) concepts in unstructured biomedical text. We first use all concepts and synonyms in HPO to construct a dictionary, which is then used to automatically build a distantly supervised training dataset for machine learning. Next, a cutting-edge deep learning model is trained to classify each candidate phrase (n-gram from input sentence) into a corresponding concept label. Finally, the dictionary and machine learning-based prediction results are combined for improved performance. Our method is validated with two HPO corpora, and the results show that PhenoTagger compares favorably to previous methods. In addition, to demonstrate the generalizability of our method, we retrained PhenoTagger using the disease ontology MEDIC for disease concept recognition to investigate the effect of training on different ontologies. Experimental results on the NCBI disease corpus show that PhenoTagger without requiring manually annotated training data achieves competitive performance as compared with state-of-the-art supervised methods.
引用
收藏
页码:1884 / 1890
页数:7
相关论文
共 32 条
[1]   Identifying Clinical Terms in Medical Text Using Ontology-Guided Machine Learning [J].
Arbabi, Aryan ;
Adams, David R. ;
Fidler, Sanja ;
Brudno, Michael .
JMIR MEDICAL INFORMATICS, 2019, 7 (02) :191-205
[2]  
Aronson AR, 2001, J AM MED INFORM ASSN, P17
[3]  
Baumgartner WA, 2008, GENOME BIOL, V9, DOI [10.1186/gb-2008-9-s2-s9, 10.1186/gb-2008-9-S2-S9]
[4]  
Bergstra J, 2012, J MACH LEARN RES, V13, P281
[5]  
Bird Steven, 2009, Natural language processing with Python: analyzing text with the natural language toolkit
[6]   PMC text mining subset in BioC: about three million full-text articles and growing [J].
Comeau, Donald C. ;
Wei, Chih-Hsuan ;
Dogan, Rezarta Islamaj ;
Lu, Zhiyong .
BIOINFORMATICS, 2019, 35 (18) :3533-3535
[7]   MEDIC: a practical disease vocabulary used at the Comparative Toxicogenomics Database [J].
Davis, Allan Peter ;
Wiegers, Thomas C. ;
Rosenstein, Michael C. ;
Mattingly, Carolyn J. .
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2012,
[8]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[9]   NCBI disease corpus: A resource for disease name recognition and concept normalization [J].
Dogan, Rezarta Islamaj ;
Leaman, Robert ;
Lu, Zhiyong .
JOURNAL OF BIOMEDICAL INFORMATICS, 2014, 47 :1-10
[10]   TRIE MEMORY [J].
FREDKIN, E .
COMMUNICATIONS OF THE ACM, 1960, 3 (09) :490-499