The CHEMDNER corpus of chemicals and drugs and its annotation principles

被引：176

作者：

Krallinger, Martin ^{[1
]}

Rabal, Obdulia ^{[2
]}

Leitner, Florian ^{[3
]}

Vazquez, Miguel ^{[1
]}

Salgado, David ^{[4
]}

Lu, Zhiyong ^{[5
]}

Leaman, Robert ^{[5
]}

Lu, Yanan ^{[6
]}

Ji, Donghong ^{[6
]}

Lowe, Daniel M. ^{[7
]}

Sayle, Roger A. ^{[7
]}

Batista-Navarro, Riza Theresa ^{[8
]}

Rak, Rafal ^{[8
]}

Huber, Torsten ^{[9
]}

Rocktaschel, Tim ^{[10
]}

Matos, Serergio ^{[11
]}

Campos, David ^{[11
]}

Tang, Buzhou ^{[12
]}

Xu, Hua ^{[13
]}

Munkhdalai, Tsendsuren ^{[14
]}

Ryu, Keun Ho ^{[14
]}

Ramanan, S. V. ^{[15
]}

Nathan, Senthil ^{[15
]}

Zitnik, Slavko ^{[16
]}

Bajec, Marko ^{[16
]}

Weber, Lutz ^{[17
]}

Irmer, Matthias ^{[17
]}

Akhondi, Saber A. ^{[18
]}

Kors, Jan A. ^{[18
]}

Xu, Shuo ^{[19
]}

An, Xin ^{[20
]}

Sikdar, Utpal Kumar ^{[21
]}

Ekbal, Asif ^{[21
]}

Yoshioka, Masaharu ^{[22
]}

Dieb, Thaer M. ^{[22
]}

Choi, Miji ^{[23
]}

Verspoor, Karin ^{[23
,24
]}

Khabsa, Madian ^{[25
]}

Giles, C. Lee ^{[25
,26
]}

Liu, Hongfang ^{[27
]}

Ravikumar, Komandur Elayavilli ^{[27
]}

Lamurias, Andre ^{[28
]}

Couto, Francisco M. ^{[28
]}

Dai, Hong-Jie ^{[29
]}

Tsai, Richard Tzong-Han ^{[30
]}

Ata, Caglar ^{[31
]}

Can, Tolga ^{[31
]}

Usie, Anabel ^{[32
,33
]}

Alves, Rui ^{[32
]}

Segura-Bedmar, Isabel ^{[34
]}

机构：

[1] Spanish Natl Canc Res Ctr, Struct Biol & BioComp Programme, Struct Computat Biol Grp, Madrid, Spain

[2] Univ Navarra, Ctr Appl Med Res CIMA, Mol Therapeut Program, Small Mol Discovery Platform, Pamplona, Spain

[3] Univ Politecn Madrid, Dept Artificial Intelligence, Computat Intelligence Grp, Madrid, Spain

[4] Fac Med Timone, Marseille, France

[5] NIH, Natl Ctr Biotechnol Informat NCBI, Bethesda, MD USA

[6] Wuhan Univ, Nat Language Proc Lab, Wuhan, Hubei, Peoples R China

[7] NextMove Software Ltd, Innovat Ctr, Unit 23, Cambridge, England

[8] Manchester Inst Biotechnol, Natl Ctr Text Min, Manchester, Lancs, England

[9] Humboldt Univ, Knowledge Management Bioinformat, Berlin, Germany

[10] UCL, Dept Comp Sci, London, England

[11] Univ Aveiro, IEETA DETI, Aveiro, Portugal

[12] Harbin Inst Technol, Shenzhen Grad Sch, Dept Comp Sci, Shenzhen, GuangDong, Peoples R China

[13] Univ Texas Houston, Sch Biomed Informat, Hlth Sci Ctr, Houston, TX USA

[14] Chungbuk Natl Univ, Sch Elect & Comp Engn, Database Bioinformat Lab, Cheongju, South Korea

[15] IIT Madras Res Pk, RelAgent Pvt Ltd, Madras, Tamil Nadu, India

[16] Univ Ljubljana, Fac Comp & Informat Sci, Ljubljana, Slovenia

[17] OntoChem GmbH, Halle, Germany

[18] Erasmus Univ, Med Ctr, Dept Med Informat, Rotterdam, Netherlands

[19] Inst Sci & Tech Informat China, Informat Technol Supporting Ctr, Beijing, Peoples R China

[20] Beijing Forestry Univ, Sch Econ & Management, Beijing, Peoples R China

[21] Indian Inst Technol, Dept Comp Sci & Engn, Patna, Bihar, India

[22] Hokkaido Univ, Grad Sch Informat Sci & Technol, Sapporo, Hokkaido, Japan

[23] Univ Melbourne, Dept Comp & Informat Syst, Melbourne, Vic, Australia

[24] Natl ICT Australia Victoria Res Lab, West Melbourne, Australia

[25] Penn State Univ, Comp Sci & Engn, University Pk, PA 16802 USA

[26] Penn State Univ, Informat Sci & Technol, University Pk, PA 16802 USA

[27] Mayo Coll Med, Dept Hlth Sci Res, Rochester, MN USA

[28] Univ Lisbon, Fac Sci, Dept Informat, LaSIGE, Lisbon, Portugal

[29] Taipei Med Univ, Coll Med Sci & Technol, Grad Inst BioMed Informat, Taipei, Taiwan

[30] Natl Cent Univ, Dept Comp Sci & Informat Engn, Taoyuan, Taiwan

[31] Middle East Tech Univ, Dept Comp Engn, Ankara, Turkey

[32] Univ Lleida, Dept Ciencias Med Basiques, Lleida, Spain

[33] Univ Lleida, Dept Informat & Engn Ind, Lleida, Spain

[34] Univ Carlos III Madrid, Comp Sci Dept, Madrid, Spain

来源：

JOURNAL OF CHEMINFORMATICS | 2015年 / 7卷

关键词：

NAMED ENTITY RECOGNITION; TEXT; EXTRACTION;

D O I：

10.1186/1758-2946-7-S1-S2

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/

引用

页数：17

共 45 条

[1] Annotated Chemical Patent Corpus: A Gold Standard for Text Mining
Akhondi, Saber A.
Klenner, Alexander G.
Tyrchan, Christian
Manchala, Anil K.
Boppana, Kiran
Lowe, Daniel
Zimmermann, Marc
Jagarlapudi, Sarma A. R. P.
Sayle, Roger
Kors, Jan A.
Muresan, Sorel
[J]. PLOS ONE, 2014, 9 (09):
[2] [Anonymous], GENOME BIOL
[3] [Anonymous], P 5 LANG RES EV C LR
[4] Arighi CN, 2014, DATABASE, V2014
[5] Concept annotation in the CRAFT corpus
Bada, Michael
Eckert, Miriam
Evans, Donald
Garcia, Kristin
Shipley, Krista
Sitnikov, Dmitry
Baumgartner, William A., Jr.
Cohen, K. Bretonnel
Verspoor, Karin
Blake, Judith A.
Hunter, Lawrence E.
[J]. BMC BIOINFORMATICS, 2012, 13
[6] BioC: a minimalist approach to interoperability for biomedical text processing
Comeau, Donald C.
Dogan, Rezarta Islamaj
Ciccarese, Paolo
Cohen, Kevin Bretonnel
Krallinger, Martin
Leitner, Florian
Lu, Zhiyong
Peng, Yifan
Rinaldi, Fabio
Torii, Manabu
Valencia, Alfonso
Verspoor, Karin
Wiegers, Thomas C.
Wu, Cathy H.
Wilbur, W. John
[J]. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2013,
[7] Corbett P., 2007, Biological, Translational, and Clinical Language Processing, P57
[8] Cascaded classifiers for confidence-based chemical named entity recognition
Corbett, Peter
Copestake, Ann
[J]. BMC BIOINFORMATICS, 2008, 9 (Suppl 11)
[9] ChEBI:: a database and ontology for chemical entities of biological interest
Degtyarenko, Kirill
de Matos, Paula
Ennis, Marcus
Hastings, Janna
Zbinden, Martin
McNaught, Alan
Alcantara, Rafael
Darsow, Michael
Guedj, Mickael
Ashburner, Michael
[J]. NUCLEIC ACIDS RESEARCH, 2008, 36 : D344 - D350
[10] MedlineRanker: flexible ranking of biomedical literature
Fontaine, Jean-Fred
Barbosa-Silva, Adriano
Schaefer, Martin
Huska, Matthew R.
Muro, Enrique M.
Andrade-Navarro, Miguel A.
[J]. NUCLEIC ACIDS RESEARCH, 2009, 37 : W141 - W146

← 1 2 3 4 5 →