A gene-phenotype relationship extraction pipeline from the biomedical literature using a representation learning approach

被引:29
作者
Xing, Wenhui [1 ]
Qi, Junsheng [2 ]
Yuan, Xiaohui [1 ]
Li, Lin [1 ]
Zhang, Xiaoyu [3 ]
Fu, Yuhua [1 ]
Xiong, Shengwu [1 ]
Hu, Lun [1 ]
Peng, Jing [1 ]
机构
[1] Wuhan Univ Technol, Sch Comp Sci & Technol, Wuhan 430070, Hubei, Peoples R China
[2] China Agr Univ, Dept Plant Sci, Coll Biol Sci, Beijing 100193, Peoples R China
[3] Huazhong Univ Sci & Technol, Britton Chance Ctr Biomed Photon, Wuhan Natl Lab Optoelect, Wuhan 430074, Hubei, Peoples R China
基金
中国国家自然科学基金;
关键词
TEXT; SYSTEM; DATABASE; DISEASE;
D O I
10.1093/bioinformatics/bty263
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: The fundamental challenge of modern genetic analysis is to establish gene-phenotype correlations that are often found in the large-scale publications. Because lexical features of gene are relatively regular in text, the main challenge of these relation extraction is phenotype recognition. Due to phenotypic descriptions are often study-or author-specific, few lexicon can be used to effectively identify the entire phenotypic expressions in text, especially for plants. Results: We have proposed a pipeline for extracting phenotype, gene and their relations from biomedical literature. Combined with abbreviation revision and sentence template extraction, we improved the unsupervised word-embedding-to-sentence-embedding cascaded approach as representation learning to recognize the various broad phenotypic information in literature. In addition, the dictionary-and rule-based method was applied for gene recognition. Finally, we integrated one of famous information extraction system OLLIE to identify gene-phenotype relations. To demonstrate the applicability of the pipeline, we established two types of comparison experiment using model organism Arabidopsis thaliana. In the comparison of state-of-the-art baselines, our approach obtained the best performance (F1-Measure of 66.83%). We also applied the pipeline to 481 full-articles from TAIR gene-phenotype manual relationship dataset to prove the validity. The results showed that our proposed pipeline can cover 70.94% of the original dataset and add 373 new relations to expand it.
引用
收藏
页码:386 / 394
页数:9
相关论文
共 39 条
  • [1] [Anonymous], 2013, EMNLP
  • [2] [Anonymous], 1999, BIOCOMPUTING
  • [3] [Anonymous], 2014, P INT C INT C MACH L
  • [4] [Anonymous], 2011, P 2011 C EMPIRICAL M
  • [5] Harmonization of gene/protein annotations: towards a gold standard MEDLINE
    Campos, David
    Matos, Sergio
    Lewin, Ian
    Oliveira, Jose Luis
    Rebholz-Schuhmann, Dietrich
    [J]. BIOINFORMATICS, 2012, 28 (09) : 1253 - 1261
  • [6] PolySearch: a web-based text mining system for extracting relationships between human diseases, genes, mutations, drugs and metabolites
    Cheng, Dean
    Knox, Craig
    Young, Nelson
    Stothard, Paul
    Damaraju, Sambasivarao
    Wishart, David S.
    [J]. NUCLEIC ACIDS RESEARCH, 2008, 36 : W399 - W405
  • [7] Chun Hong-Woo, 2006, Pac Symp Biocomput, P4
  • [8] Next-generation phenotyping: requirements and strategies for enhancing our understanding of genotype-phenotype relationships and its relevance to crop improvement
    Cobb, Joshua N.
    DeClerck, Genevieve
    Greenberg, Anthony
    Clark, Randy
    McCouch, Susan
    [J]. THEORETICAL AND APPLIED GENETICS, 2013, 126 (04) : 867 - 887
  • [9] A survey of current work in biomedical text mining
    Cohen, AM
    Hersh, WR
    [J]. BRIEFINGS IN BIOINFORMATICS, 2005, 6 (01) : 57 - 71
  • [10] Collier N, 2015, DATABASE, V2015, pbav104