nala: text mining natural language mutation mentions

被引:12
作者
Cejuela, Juan Miguel [1 ,2 ]
Bojchevski, Aleksandar [1 ,2 ]
Uhlig, Carsten [1 ]
Bekmukhametov, Rustem [1 ,3 ]
Karn, Sanjeev Kumar [1 ,4 ,5 ]
Mahmuti, Shpend [1 ]
Baghudana, Ashish [1 ,6 ]
Dubey, Ankit [1 ,7 ]
Satagopam, Venkata P. [8 ]
Rost, Burkhard [1 ,9 ,10 ,11 ,12 ]
机构
[1] TUM, Dept Informat Bioinformat & Computat Biol i12, D-85748 Munich, Germany
[2] TUM Grad Sch, Ctr Doctoral Studies Informat & Its Applicat CeDo, D-85748 Garching, Germany
[3] Microsoft, Bellevue, WA 98008 USA
[4] Ludwig Maximilians Univ Munchen, D-80538 Munich, Germany
[5] Siemens AG, Corp Technol, D-81739 Munich, Germany
[6] BITS Pilani, KK Birla Goa Campus, Sancoale 403726, Goa, India
[7] Concur Germany GmbH, D-60528 Frankfurt, Germany
[8] Univ Luxembourg, LCSB, L-4367 Belvaux, Luxembourg
[9] Columbia Univ, Inst Adv Study TUM IAS, New York, NY 10032 USA
[10] Columbia Univ, Inst Food & Plant Sci WZW Weihenstephan, New York, NY 10032 USA
[11] Columbia Univ, New York Consortium Membrane Prot Struct NYCOMPS, New York, NY 10032 USA
[12] Columbia Univ, Dept Biochem & Mol Biophys, New York, NY 10032 USA
关键词
SEQUENCE VARIANTS; DATABASE;
D O I
10.1093/bioinformatics/btx083
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: The extraction of sequence variants from the literature remains an important task. Existing methods primarily target standard (ST) mutation mentions (e.g. 'E6V'), leaving relevant mentions natural language (NL) largely untapped (e.g. 'glutamic acid was substituted by valine at residue 6'). Results: We introduced three new corpora suggesting named-entity recognition (NER) to be more challenging than anticipated: 28-77% of all articles contained mentions only available in NL. Our new method nala captured NL and ST by combining conditional random fields with word embedding features learned unsupervised from the entire PubMed. In our hands, nala substantially outperformed the state-of-the-art. For instance, we compared all unique mentions in new discoveries correctly detected by any of three methods (SETH, tmVar, or nala). Neither SETH nor tmVar discovered anything missed by nala, while nala uniquely tagged 33% mentions. For NL mentions the corresponding value shot up to 100% nala-only.
引用
收藏
页码:1852 / 1858
页数:7
相关论文
共 31 条
[1]  
[Anonymous], 2016, International Journal of Software Engineering and Its Applications, DOI DOI 10.14257/ijseia.2016.10.2.08
[2]  
[Anonymous], 2004, P INT JOINT WORKSH N
[3]  
[Anonymous], 2001, CONDITIONAL RANDOM F
[4]   UniProt: a hub for protein information [J].
Bateman, Alex ;
Martin, Maria Jesus ;
O'Donovan, Claire ;
Magrane, Michele ;
Apweiler, Rolf ;
Alpi, Emanuele ;
Antunes, Ricardo ;
Arganiska, Joanna ;
Bely, Benoit ;
Bingley, Mark ;
Bonilla, Carlos ;
Britto, Ramona ;
Bursteinas, Borisas ;
Chavali, Gayatri ;
Cibrian-Uhalte, Elena ;
Da Silva, Alan ;
De Giorgi, Maurizio ;
Dogan, Tunca ;
Fazzini, Francesco ;
Gane, Paul ;
Cas-tro, Leyla Garcia ;
Garmiri, Penelope ;
Hatton-Ellis, Emma ;
Hieta, Reija ;
Huntley, Rachael ;
Legge, Duncan ;
Liu, Wudong ;
Luo, Jie ;
MacDougall, Alistair ;
Mutowo, Prudence ;
Nightin-gale, Andrew ;
Orchard, Sandra ;
Pichler, Klemens ;
Poggioli, Diego ;
Pundir, Sangya ;
Pureza, Luis ;
Qi, Guoying ;
Rosanoff, Steven ;
Saidi, Rabie ;
Sawford, Tony ;
Shypitsyna, Aleksandra ;
Turner, Edward ;
Volynkin, Vladimir ;
Wardell, Tony ;
Watkins, Xavier ;
Zellner, Hermann ;
Cowley, Andrew ;
Figueira, Luis ;
Li, Weizhong ;
McWilliam, Hamish .
NUCLEIC ACIDS RESEARCH, 2015, 43 (D1) :D204-D212
[5]  
Boutet E, 2016, METHODS MOL BIOL, V1374, P23, DOI 10.1007/978-1-4939-3167-5_2
[6]  
Caporaso J.G., 2007, BIOCOMPUTING 2008
[7]   MutationFinder: a high-performance system for extracting point mutation mentions from text [J].
Caporaso, J. Gregory ;
Baumgartner, William A., Jr. ;
Randolph, David A. ;
Cohen, K. Bretonnel ;
Hunter, Lawrence .
BIOINFORMATICS, 2007, 23 (14) :1862-1865
[8]   tagtog: interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles [J].
Cejuela, Juan Miguel ;
McQuilton, Peter ;
Ponting, Laura ;
Marygold, Steven J. ;
Stefancsik, Raymund ;
Millburn, Gillian H. ;
Rost, Burkhard .
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2014,
[9]   The HIV Mutation Browser: A Resource for Human Immunodeficiency Virus Mutagenesis and Polymorphism Data [J].
Davey, Norman E. ;
Satagopam, Venkata P. ;
Santiago-Mozos, Salvador ;
Villacorta-Martin, Carlos ;
Bharat, Tanmay A. M. ;
Schneider, Reinhard ;
Briggs, John A. G. .
PLOS COMPUTATIONAL BIOLOGY, 2014, 10 (12)
[10]   HGVS Recommendations for the Description of Sequence Variants: 2016 Update [J].
den Dunnen, Johan T. ;
Dalgleish, Raymond ;
Maglott, Donna R. ;
Hart, Reece K. ;
Greenblatt, Marc S. ;
McGowan-Jordan, Jean ;
Roux, Anne-Francoise ;
Smith, Timothy ;
Antonarakis, Stylianos E. ;
Taschner, Peter E. M. .
HUMAN MUTATION, 2016, 37 (06) :564-569