The CHEMDNER corpus of chemicals and drugs and its annotation principles

被引:176
作者
Krallinger, Martin [1 ]
Rabal, Obdulia [2 ]
Leitner, Florian [3 ]
Vazquez, Miguel [1 ]
Salgado, David [4 ]
Lu, Zhiyong [5 ]
Leaman, Robert [5 ]
Lu, Yanan [6 ]
Ji, Donghong [6 ]
Lowe, Daniel M. [7 ]
Sayle, Roger A. [7 ]
Batista-Navarro, Riza Theresa [8 ]
Rak, Rafal [8 ]
Huber, Torsten [9 ]
Rocktaschel, Tim [10 ]
Matos, Serergio [11 ]
Campos, David [11 ]
Tang, Buzhou [12 ]
Xu, Hua [13 ]
Munkhdalai, Tsendsuren [14 ]
Ryu, Keun Ho [14 ]
Ramanan, S. V. [15 ]
Nathan, Senthil [15 ]
Zitnik, Slavko [16 ]
Bajec, Marko [16 ]
Weber, Lutz [17 ]
Irmer, Matthias [17 ]
Akhondi, Saber A. [18 ]
Kors, Jan A. [18 ]
Xu, Shuo [19 ]
An, Xin [20 ]
Sikdar, Utpal Kumar [21 ]
Ekbal, Asif [21 ]
Yoshioka, Masaharu [22 ]
Dieb, Thaer M. [22 ]
Choi, Miji [23 ]
Verspoor, Karin [23 ,24 ]
Khabsa, Madian [25 ]
Giles, C. Lee [25 ,26 ]
Liu, Hongfang [27 ]
Ravikumar, Komandur Elayavilli [27 ]
Lamurias, Andre [28 ]
Couto, Francisco M. [28 ]
Dai, Hong-Jie [29 ]
Tsai, Richard Tzong-Han [30 ]
Ata, Caglar [31 ]
Can, Tolga [31 ]
Usie, Anabel [32 ,33 ]
Alves, Rui [32 ]
Segura-Bedmar, Isabel [34 ]
机构
[1] Spanish Natl Canc Res Ctr, Struct Biol & BioComp Programme, Struct Computat Biol Grp, Madrid, Spain
[2] Univ Navarra, Ctr Appl Med Res CIMA, Mol Therapeut Program, Small Mol Discovery Platform, Pamplona, Spain
[3] Univ Politecn Madrid, Dept Artificial Intelligence, Computat Intelligence Grp, Madrid, Spain
[4] Fac Med Timone, Marseille, France
[5] NIH, Natl Ctr Biotechnol Informat NCBI, Bethesda, MD USA
[6] Wuhan Univ, Nat Language Proc Lab, Wuhan, Hubei, Peoples R China
[7] NextMove Software Ltd, Innovat Ctr, Unit 23, Cambridge, England
[8] Manchester Inst Biotechnol, Natl Ctr Text Min, Manchester, Lancs, England
[9] Humboldt Univ, Knowledge Management Bioinformat, Berlin, Germany
[10] UCL, Dept Comp Sci, London, England
[11] Univ Aveiro, IEETA DETI, Aveiro, Portugal
[12] Harbin Inst Technol, Shenzhen Grad Sch, Dept Comp Sci, Shenzhen, GuangDong, Peoples R China
[13] Univ Texas Houston, Sch Biomed Informat, Hlth Sci Ctr, Houston, TX USA
[14] Chungbuk Natl Univ, Sch Elect & Comp Engn, Database Bioinformat Lab, Cheongju, South Korea
[15] IIT Madras Res Pk, RelAgent Pvt Ltd, Madras, Tamil Nadu, India
[16] Univ Ljubljana, Fac Comp & Informat Sci, Ljubljana, Slovenia
[17] OntoChem GmbH, Halle, Germany
[18] Erasmus Univ, Med Ctr, Dept Med Informat, Rotterdam, Netherlands
[19] Inst Sci & Tech Informat China, Informat Technol Supporting Ctr, Beijing, Peoples R China
[20] Beijing Forestry Univ, Sch Econ & Management, Beijing, Peoples R China
[21] Indian Inst Technol, Dept Comp Sci & Engn, Patna, Bihar, India
[22] Hokkaido Univ, Grad Sch Informat Sci & Technol, Sapporo, Hokkaido, Japan
[23] Univ Melbourne, Dept Comp & Informat Syst, Melbourne, Vic, Australia
[24] Natl ICT Australia Victoria Res Lab, West Melbourne, Australia
[25] Penn State Univ, Comp Sci & Engn, University Pk, PA 16802 USA
[26] Penn State Univ, Informat Sci & Technol, University Pk, PA 16802 USA
[27] Mayo Coll Med, Dept Hlth Sci Res, Rochester, MN USA
[28] Univ Lisbon, Fac Sci, Dept Informat, LaSIGE, Lisbon, Portugal
[29] Taipei Med Univ, Coll Med Sci & Technol, Grad Inst BioMed Informat, Taipei, Taiwan
[30] Natl Cent Univ, Dept Comp Sci & Informat Engn, Taoyuan, Taiwan
[31] Middle East Tech Univ, Dept Comp Engn, Ankara, Turkey
[32] Univ Lleida, Dept Ciencias Med Basiques, Lleida, Spain
[33] Univ Lleida, Dept Informat & Engn Ind, Lleida, Spain
[34] Univ Carlos III Madrid, Comp Sci Dept, Madrid, Spain
来源
JOURNAL OF CHEMINFORMATICS | 2015年 / 7卷
关键词
NAMED ENTITY RECOGNITION; TEXT; EXTRACTION;
D O I
10.1186/1758-2946-7-S1-S2
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/
引用
收藏
页数:17
相关论文
共 45 条
  • [1] Annotated Chemical Patent Corpus: A Gold Standard for Text Mining
    Akhondi, Saber A.
    Klenner, Alexander G.
    Tyrchan, Christian
    Manchala, Anil K.
    Boppana, Kiran
    Lowe, Daniel
    Zimmermann, Marc
    Jagarlapudi, Sarma A. R. P.
    Sayle, Roger
    Kors, Jan A.
    Muresan, Sorel
    [J]. PLOS ONE, 2014, 9 (09):
  • [2] [Anonymous], GENOME BIOL
  • [3] [Anonymous], P 5 LANG RES EV C LR
  • [4] Arighi CN, 2014, DATABASE, V2014
  • [5] Concept annotation in the CRAFT corpus
    Bada, Michael
    Eckert, Miriam
    Evans, Donald
    Garcia, Kristin
    Shipley, Krista
    Sitnikov, Dmitry
    Baumgartner, William A., Jr.
    Cohen, K. Bretonnel
    Verspoor, Karin
    Blake, Judith A.
    Hunter, Lawrence E.
    [J]. BMC BIOINFORMATICS, 2012, 13
  • [6] BioC: a minimalist approach to interoperability for biomedical text processing
    Comeau, Donald C.
    Dogan, Rezarta Islamaj
    Ciccarese, Paolo
    Cohen, Kevin Bretonnel
    Krallinger, Martin
    Leitner, Florian
    Lu, Zhiyong
    Peng, Yifan
    Rinaldi, Fabio
    Torii, Manabu
    Valencia, Alfonso
    Verspoor, Karin
    Wiegers, Thomas C.
    Wu, Cathy H.
    Wilbur, W. John
    [J]. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2013,
  • [7] Corbett P., 2007, Biological, Translational, and Clinical Language Processing, P57
  • [8] Cascaded classifiers for confidence-based chemical named entity recognition
    Corbett, Peter
    Copestake, Ann
    [J]. BMC BIOINFORMATICS, 2008, 9 (Suppl 11)
  • [9] ChEBI:: a database and ontology for chemical entities of biological interest
    Degtyarenko, Kirill
    de Matos, Paula
    Ennis, Marcus
    Hastings, Janna
    Zbinden, Martin
    McNaught, Alan
    Alcantara, Rafael
    Darsow, Michael
    Guedj, Mickael
    Ashburner, Michael
    [J]. NUCLEIC ACIDS RESEARCH, 2008, 36 : D344 - D350
  • [10] MedlineRanker: flexible ranking of biomedical literature
    Fontaine, Jean-Fred
    Barbosa-Silva, Adriano
    Schaefer, Martin
    Huska, Matthew R.
    Muro, Enrique M.
    Andrade-Navarro, Miguel A.
    [J]. NUCLEIC ACIDS RESEARCH, 2009, 37 : W141 - W146