Recognizing software names in biomedical literature using machine learning

被引:7
作者
Wei, Qiang [1 ]
Zhang, Yaoyun [1 ]
Amith, Muhammad [1 ]
Lin, Rebecca [2 ]
Lapeyrolerie, Jenay [3 ]
Tao, Cui [1 ]
Xu, Hua [1 ]
机构
[1] Univ Texas Hlth Sci Ctr Houston, Houston, TX 77030 USA
[2] Johns Hopkins Univ, Baltimore, MD 21218 USA
[3] Baylor Univ, Waco, TX 76798 USA
关键词
biomedical literature; biomedical software; biomedical software index; named entity recognition; natural language processing; BIOINFORMATICS; SERVICES;
D O I
10.1177/1460458219869490
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Software tools now are essential to research and applications in the biomedical domain. However, existing software repositories are mainly built using manual curation, which is time-consuming and unscalable. This study took the initiative to manually annotate software names in 1,120 MEDLINE abstracts and titles and used this corpus to develop and evaluate machine learning-based named entity recognition systems for biomedical software. Specifically, two strategies were proposed for feature engineering: (1) domain knowledge features and (2) unsupervised word representation features of clustered and binarized word embeddings. Our best system achieved an F-measure of 91.79% for recognizing software from titles and an F-measure of 86.35% for recognizing software from both titles and abstracts using inexact matching criteria. We then created a biomedical software catalog with 19,557 entries using the developed system. This study demonstrates the feasibility of using natural language processing methods to automatically build a high-quality software index from biomedical literature.
引用
收藏
页码:21 / 33
页数:13
相关论文
共 18 条
[1]   Knowledge-Based Approach for Named Entity Recognition in Biomedical Literature: A Use Case in Biomedical Software Identification [J].
Amith, Muhammad ;
Zhang, Yaoyun ;
Xu, Hua ;
Tao, Cui .
ADVANCES IN ARTIFICIAL INTELLIGENCE: FROM THEORY TO PRACTICE (IEA/AIE 2017), PT II, 2017, 10351 :386-395
[2]   BioCatalogue: a universal catalogue of web services for the life sciences [J].
Bhagat, Jiten ;
Tanoh, Franck ;
Nzuobontane, Eric ;
Laurent, Thomas ;
Orlowski, Jerzy ;
Roos, Marco ;
Wolstencroft, Katy ;
Aleksejevs, Sergejs ;
Stevens, Robert ;
Pettifer, Steve ;
Lopez, Rodrigo ;
Goble, Carole A. .
NUCLEIC ACIDS RESEARCH, 2010, 38 :W689-W694
[3]   Named entity recognition with multiple segment representations [J].
Cho, Han-Cheol ;
Okazaki, Naoaki ;
Miwa, Makoto ;
Tsujii, Jun'ichi .
INFORMATION PROCESSING & MANAGEMENT, 2013, 49 (04) :954-965
[4]  
Collobert R., 2008, P 25 INT C MACH LEAR, P160, DOI DOI 10.1145/1390156.1390177.ICML08
[5]   A Survey of Bioinformatics Database and Software Usage through Mining the Literature [J].
Duck, Geraint ;
Nenadic, Goran ;
Filannino, Michele ;
Brass, Andy ;
Robertson, David L. ;
Stevens, Robert .
PLOS ONE, 2016, 11 (06)
[6]   bioNerDS: exploring bioinformatics' database and software use through literature mining [J].
Duck, Geraint ;
Nenadic, Goran ;
Brass, Andy ;
Robertson, David L. ;
Stevens, Robert .
BMC BIOINFORMATICS, 2013, 14
[7]   Bioconductor: open software development for computational biology and bioinformatics [J].
Gentleman, RC ;
Carey, VJ ;
Bates, DM ;
Bolstad, B ;
Dettling, M ;
Dudoit, S ;
Ellis, B ;
Gautier, L ;
Ge, YC ;
Gentry, J ;
Hornik, K ;
Hothorn, T ;
Huber, W ;
Iacus, S ;
Irizarry, R ;
Leisch, F ;
Li, C ;
Maechler, M ;
Rossini, AJ ;
Sawitzki, G ;
Smith, C ;
Smyth, G ;
Tierney, L ;
Yang, JYH ;
Zhang, JH .
GENOME BIOLOGY, 2004, 5 (10)
[8]   BioJS']JS: an open source Java']JavaScript framework for biological data visualization [J].
Gomez, John ;
Garcia, Leyla J. ;
Salazar, Gustavo A. ;
Villaveces, Jose ;
Gore, Swanand ;
Garcia, Alexander ;
Martin, Maria J. ;
Launay, Guillaume ;
Alcantara, Rafael ;
del-Toro, Noemi ;
Dumousseau, Marine ;
Orchard, Sandra ;
Velankar, Sameer ;
Hermjakob, Henning ;
Zong, Chenggong ;
Ping, Peipei ;
Corpas, Manuel ;
Jimenez, Rafael C. .
BIOINFORMATICS, 2013, 29 (08) :1103-1104
[9]  
Guo J., 2014, P 2014 C EMP METH NA, P110
[10]   OMICtools: an informative directory for multi-omic data analysis [J].
Henry, Vincent J. ;
Bandrowski, Anita E. ;
Pepin, Anne-Sophie ;
Gonzalez, Bruno J. ;
Desfeux, Arnaud .
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2014,