NLM-Chem, a new resource for chemical entity recognition in PubMed full text literature

被引:45
作者
Islamaj, Rezarta [1 ]
Leaman, Robert [1 ]
Kim, Sun [1 ]
Kwon, Dongseop [1 ]
Wei, Chih-Hsuan [1 ]
Comeau, Donald C. [1 ]
Peng, Yifan [1 ]
Cissel, David [1 ]
Coss, Cathleen [1 ]
Fisher, Carol [1 ]
Guzman, Rob [1 ]
Kochar, Preeti Gokal [1 ]
Koppel, Stella [1 ]
Trinh, Dorothy [1 ]
Sekiya, Keiko [1 ]
Ward, Janice [1 ]
Whitman, Deborah [1 ]
Schmidt, Susan [1 ]
Lu, Zhiyong [1 ]
机构
[1] NIH, Natl Lib Med, Bethesda, MD 20894 USA
基金
美国国家卫生研究院;
关键词
DATABASE; CORPUS; DRUGS;
D O I
10.1038/s41597-021-00875-1
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Automatically identifying chemical and drug names in scientific publications advances information access for this important class of entities in a variety of biomedical disciplines by enabling improved retrieval and linkage to related concepts. While current methods for tagging chemical entities were developed for the article title and abstract, their performance in the full article text is substantially lower. However, the full text frequently contains more detailed chemical information, such as the properties of chemical compounds, their biological effects and interactions with diseases, genes and other chemicals. We therefore present the NLM-Chem corpus, a full-text resource to support the development and evaluation of automated chemical entity taggers. The NLM-Chem corpus consists of 150 full-text articles, doubly annotated by ten expert NLM indexers, with similar to 5000 unique chemical name annotations, mapped to similar to 2000 MeSH identifiers. We also describe a substantially improved chemical entity tagger, with automated annotations for all of PubMed and PMC freely accessible through the PubTator web-based interface and API. The NLM-Chem corpus is freely available.
引用
收藏
页数:12
相关论文
共 32 条
[1]   Automatic identification of relevant chemical compounds from patents [J].
Akhondi, Saber A. ;
Rey, Hinnerk ;
Schwoerer, Markus ;
Maier, Michael ;
Toomey, John ;
Nau, Heike ;
Ilchmann, Gabriele ;
Sheehan, Mark ;
Irmer, Matthias ;
Bobach, Claudia ;
Doornenbal, Marius ;
Gregory, Michelle ;
Kors, Jan A. .
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2019,
[2]   Concept annotation in the CRAFT corpus [J].
Bada, Michael ;
Eckert, Miriam ;
Evans, Donald ;
Garcia, Kristin ;
Shipley, Krista ;
Sitnikov, Dmitry ;
Baumgartner, William A., Jr. ;
Cohen, K. Bretonnel ;
Verspoor, Karin ;
Blake, Judith A. ;
Hunter, Lawrence E. .
BMC BIOINFORMATICS, 2012, 13
[3]   The Unified Medical Language System (UMLS): integrating biomedical terminology [J].
Bodenreider, O .
NUCLEIC ACIDS RESEARCH, 2004, 32 :D267-D270
[4]  
Comeau D.C., BIOC API PMC
[5]   PMC text mining subset in BioC: about three million full-text articles and growing [J].
Comeau, Donald C. ;
Wei, Chih-Hsuan ;
Dogan, Rezarta Islamaj ;
Lu, Zhiyong .
BIOINFORMATICS, 2019, 35 (18) :3533-3535
[6]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[7]   Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine [J].
Dogan, Rezarta Islamaj ;
Kim, Sun ;
Chatr-aryamontri, Andrew ;
Wei, Chih-Hsuan ;
Comeau, Donald C. ;
Antunes, Rui ;
Matos, Sergio ;
Chen, Qingyu ;
Elangovan, Aparna ;
Panyam, Nagesh C. ;
Verspoor, Karin ;
Liu, Hongfang ;
Wang, Yanshan ;
Liu, Zhuang ;
Altinel, Berna ;
Husunbeyi, Zehra Melce ;
Ozgur, Arzucan ;
Fergadis, Aris ;
Wang, Chen-Kai ;
Dai, Hong-Jie ;
Tran, Tung ;
Kavuluru, Ramakanth ;
Luo, Ling ;
Steppi, Albert ;
Zhang, Jinfeng ;
Qu, Jinchan ;
Lu, Zhiyong .
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2019,
[8]   The BioC-BioGRID corpus: full text articles annotated for curation of protein-protein and genetic interactions [J].
Dogan, Rezarta Islamaj ;
Kim, Sun ;
Chatr-aryamontri, Andrew ;
Chang, Christie S. ;
Oughtred, Rose ;
Rust, Jennifer ;
Wilbur, W. John ;
Comeau, Donald C. ;
Dolinski, Kara ;
Tyers, Mike .
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2017,
[9]   NCBI disease corpus: A resource for disease name recognition and concept normalization [J].
Dogan, Rezarta Islamaj ;
Leaman, Robert ;
Lu, Zhiyong .
JOURNAL OF BIOMEDICAL INFORMATICS, 2014, 47 :1-10
[10]   Understanding PubMed® user search behavior through log analysis [J].
Dogan, Rezarta Islamaj ;
Murray, G. Craig ;
Neveol, Aurelie ;
Lu, Zhiyong .
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2009,