A gene-phenotype relationship extraction pipeline from the biomedical literature using a representation learning approach

被引：29

作者：

Xing, Wenhui ^{[1
]}

Qi, Junsheng ^{[2
]}

Yuan, Xiaohui ^{[1
]}

Li, Lin ^{[1
]}

Zhang, Xiaoyu ^{[3
]}

Fu, Yuhua ^{[1
]}

Xiong, Shengwu ^{[1
]}

Hu, Lun ^{[1
]}

Peng, Jing ^{[1
]}

机构：

[1] Wuhan Univ Technol, Sch Comp Sci & Technol, Wuhan 430070, Hubei, Peoples R China

[2] China Agr Univ, Dept Plant Sci, Coll Biol Sci, Beijing 100193, Peoples R China

[3] Huazhong Univ Sci & Technol, Britton Chance Ctr Biomed Photon, Wuhan Natl Lab Optoelect, Wuhan 430074, Hubei, Peoples R China

来源：

BIOINFORMATICS | 2018年 / 34卷 / 13期

基金：

中国国家自然科学基金;

关键词：

TEXT; SYSTEM; DATABASE; DISEASE;

D O I：

10.1093/bioinformatics/bty263

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Motivation: The fundamental challenge of modern genetic analysis is to establish gene-phenotype correlations that are often found in the large-scale publications. Because lexical features of gene are relatively regular in text, the main challenge of these relation extraction is phenotype recognition. Due to phenotypic descriptions are often study-or author-specific, few lexicon can be used to effectively identify the entire phenotypic expressions in text, especially for plants. Results: We have proposed a pipeline for extracting phenotype, gene and their relations from biomedical literature. Combined with abbreviation revision and sentence template extraction, we improved the unsupervised word-embedding-to-sentence-embedding cascaded approach as representation learning to recognize the various broad phenotypic information in literature. In addition, the dictionary-and rule-based method was applied for gene recognition. Finally, we integrated one of famous information extraction system OLLIE to identify gene-phenotype relations. To demonstrate the applicability of the pipeline, we established two types of comparison experiment using model organism Arabidopsis thaliana. In the comparison of state-of-the-art baselines, our approach obtained the best performance (F1-Measure of 66.83%). We also applied the pipeline to 481 full-articles from TAIR gene-phenotype manual relationship dataset to prove the validity. The results showed that our proposed pipeline can cover 70.94% of the original dataset and add 373 new relations to expand it.

引用

页码：386 / 394

页数：9

共 39 条

[1] [Anonymous], 2013, EMNLP
[2] [Anonymous], 1999, BIOCOMPUTING
[3] [Anonymous], 2014, P INT C INT C MACH L
[4] [Anonymous], 2011, P 2011 C EMPIRICAL M
[5] Harmonization of gene/protein annotations: towards a gold standard MEDLINE
Campos, David
Matos, Sergio
Lewin, Ian
Oliveira, Jose Luis
Rebholz-Schuhmann, Dietrich
[J]. BIOINFORMATICS, 2012, 28 (09) : 1253 - 1261
[6] PolySearch: a web-based text mining system for extracting relationships between human diseases, genes, mutations, drugs and metabolites
Cheng, Dean
Knox, Craig
Young, Nelson
Stothard, Paul
Damaraju, Sambasivarao
Wishart, David S.
[J]. NUCLEIC ACIDS RESEARCH, 2008, 36 : W399 - W405
[7] Chun Hong-Woo, 2006, Pac Symp Biocomput, P4
[8] Next-generation phenotyping: requirements and strategies for enhancing our understanding of genotype-phenotype relationships and its relevance to crop improvement
Cobb, Joshua N.
DeClerck, Genevieve
Greenberg, Anthony
Clark, Randy
McCouch, Susan
[J]. THEORETICAL AND APPLIED GENETICS, 2013, 126 (04) : 867 - 887
[9] A survey of current work in biomedical text mining
Cohen, AM
Hersh, WR
[J]. BRIEFINGS IN BIOINFORMATICS, 2005, 6 (01) : 57 - 71
[10] Collier N, 2015, DATABASE, V2015, pbav104

← 1 2 3 4 →