BioCreative V CDR task corpus: a resource for chemical disease relation extraction

被引:533
作者
Li, Jiao [1 ]
Sun, Yueping [1 ]
Johnson, Robin J. [2 ,3 ]
Sciaky, Daniela [2 ,3 ]
Wei, Chih-Hsuan [4 ]
Leaman, Robert [4 ]
Davis, Allan Peter [2 ,3 ]
Mattingly, Carolyn J. [2 ,3 ]
Wiegers, Thomas C. [2 ,3 ]
Lu, Zhiyong [4 ]
机构
[1] Chinese Acad Med Sci, Inst Med Informat, Beijing 100020, Peoples R China
[2] North Carolina State Univ, Dept Biol Sci, Raleigh, NC 27695 USA
[3] North Carolina State Univ, Ctr Human Hlth & Environm, Raleigh, NC 27695 USA
[4] Natl Ctr Biotechnol Informat, Bethesda, MD 20894 USA
来源
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION | 2016年
基金
美国国家卫生研究院;
关键词
COMPARATIVE TOXICOGENOMICS DATABASE; RECOGNITION; ANNOTATION; ARTICLES; TOOL;
D O I
10.1093/database/baw068
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Community-run, formal evaluations and manually annotated text corpora are critically important for advancing biomedical text-mining research. Recently in BioCreative V, a new challenge was organized for the tasks of disease named entity recognition (DNER) and chemical-induced disease (CID) relation extraction. Given the nature of both tasks, a test collection is required to contain both disease/chemical annotations and relation annotations in the same set of articles. Despite previous efforts in biomedical corpus construction, none was found to be sufficient for the task. Thus, we developed our own corpus called BC5CDR during the challenge by inviting a team of Medical Subject Headings (MeSH) indexers for disease/chemical entity annotation and Comparative Toxicogenomics Database (CTD) curators for CID relation annotation. To ensure high annotation quality and productivity, detailed annotation guidelines and automatic annotation tools were provided. The resulting BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions. Each entity annotation includes both the mention text spans and normalized concept identifiers, using MeSH as the controlled vocabulary. To ensure accuracy, the entities were first captured independently by two annotators followed by a consensus annotation: The average inter-annotator agreement (IAA) scores were 87.49% and 96.05% for the disease and chemicals, respectively, in the test set according to the Jaccard similarity coefficient. Our corpus was successfully used for the BioCreative V challenge tasks and should serve as a valuable resource for the text-mining research community.
引用
收藏
页数:10
相关论文
共 32 条
[1]  
[Anonymous], 2010, 2 WORKSH BUILD EV RE
[2]   BioC: a minimalist approach to interoperability for biomedical text processing [J].
Comeau, Donald C. ;
Dogan, Rezarta Islamaj ;
Ciccarese, Paolo ;
Cohen, Kevin Bretonnel ;
Krallinger, Martin ;
Leitner, Florian ;
Lu, Zhiyong ;
Peng, Yifan ;
Rinaldi, Fabio ;
Torii, Manabu ;
Valencia, Alfonso ;
Verspoor, Karin ;
Wiegers, Thomas C. ;
Wu, Cathy H. ;
Wilbur, W. John .
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2013,
[3]   The Comparative Toxicogenomics Database's 10th year anniversary: update 2015 [J].
Davis, Allan Peter ;
Grondin, Cynthia J. ;
Lennon-Hopkins, Kelley ;
Saraceni-Richards, Cynthia ;
Sciaky, Daniela ;
King, Benjamin L. ;
Wiegers, Thomas C. ;
Mattingly, Carolyn J. .
NUCLEIC ACIDS RESEARCH, 2015, 43 (D1) :D914-D920
[4]   A CTD-Pfizer collaboration: manual curation of 88 000 scientific articles text mined for drug-disease and drug-phenotype interactions [J].
Davis, Allan Peter ;
Wiegers, Thomas C. ;
Roberts, Phoebe M. ;
King, Benjamin L. ;
Lay, Jean M. ;
Lennon-Hopkins, Kelley ;
Sciaky, Daniela ;
Johnson, Robin ;
Keating, Heather ;
Greene, Nigel ;
Hernandez, Robert ;
McConnell, Kevin J. ;
Enayetallah, Ahmed E. ;
Mattingly, Carolyn J. .
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2013,
[5]   The curation paradigm and application tool used for manual curation of the scientific literature at the Comparative Toxicogenomics Database [J].
Davis, Allan Peter ;
Wiegers, Thomas C. ;
Murphy, Cynthia G. ;
Mattingly, Carolyn J. .
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2011,
[6]   Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical-gene-disease networks [J].
Davis, Allan Peter ;
Murphy, Cynthia G. ;
Saraceni-Richards, Cynthia A. ;
Rosenstein, Michael C. ;
Wiegers, Thomas C. ;
Mattingly, Carolyn J. .
NUCLEIC ACIDS RESEARCH, 2009, 37 :D786-D792
[7]   NCBI disease corpus: A resource for disease name recognition and concept normalization [J].
Dogan, Rezarta Islamaj ;
Leaman, Robert ;
Lu, Zhiyong .
JOURNAL OF BIOMEDICAL INFORMATICS, 2014, 47 :1-10
[8]   Understanding PubMed® user search behavior through log analysis [J].
Dogan, Rezarta Islamaj ;
Murray, G. Craig ;
Neveol, Aurelie ;
Lu, Zhiyong .
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2009,
[9]  
Dogan RI, 2012, P 2012 WORKSH BIOM N, P91
[10]   Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports [J].
Gurulingappa, Harsha ;
Rajput, Abdul Mateen ;
Roberts, Angus ;
Fluck, Juliane ;
Hofmann-Apitius, Martin ;
Toldo, Luca .
JOURNAL OF BIOMEDICAL INFORMATICS, 2012, 45 (05) :885-892