A Survey of Bioinformatics Database and Software Usage through Mining the Literature

被引:41
作者
Duck, Geraint [1 ]
Nenadic, Goran [1 ,2 ]
Filannino, Michele [1 ]
Brass, Andy [1 ]
Robertson, David L. [3 ]
Stevens, Robert [1 ]
机构
[1] Univ Manchester, Sch Comp Sci, Manchester, Lancs, England
[2] Univ Manchester, Manchester Inst Biotechnol, Manchester, Lancs, England
[3] Univ Manchester, Fac Life Sci, Computat & Evolutionary Biol, Manchester, Lancs, England
基金
英国工程与自然科学研究理事会; 英国生物技术与生命科学研究理事会;
关键词
D O I
10.1371/journal.pone.0157989
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Computer-based resources are central to much, if not most, biological and medical research. However, while there is an ever expanding choice of bioinformatics resources to use, described within the biomedical literature, little work to date has provided an evaluation of the full range of availability or levels of usage of database and software resources. Here we use text mining to process the PubMed Central full-text corpus, identifying mentions of databases or software within the scientific literature. We provide an audit of the resources contained within the biomedical literature, and a comparison of their relative usage, both over time and between the sub-disciplines of bioinformatics, biology and medicine. We find that trends in resource usage differs between these domains. The bioinformatics literature emphasises novel resource development, while database and software usage within biology and medicine is more stable and conservative. Many resources are only mentioned in the bioinformatics literature, with a relatively small number making it out into general biology, and fewer still into the medical literature. In addition, many resources are seeing a steady decline in their usage (e.g., BLAST, SWISS-PROT), though some are instead seeing rapid growth (e.g., the GO, R). We find a striking imbalance in resource usage with the top 5% of resource names (133 names) accounting for 47% of total usage, and over 70% of resources extracted being only mentioned once each. While these results highlight the dynamic and creative nature of bioinformatics research they raise questions about software reuse, choice and the sharing of bioinformatics practice. Is it acceptable that so many resources are apparently never reused? Finally, our work is a step towards automated extraction of scientific method from text. We make the dataset generated by our study available under the CCO license.
引用
收藏
页数:25
相关论文
共 32 条
[1]  
[Anonymous], 1992, COLING 1992, DOI DOI 10.3115/992133.992154
[2]  
[Anonymous], CORR
[3]  
[Anonymous], SOFTWARE ENG INSTRUM
[4]  
[Anonymous], 2011, P ACM 2011 C COMPUTE, DOI DOI 10.1145/1958824.1958904
[5]  
[Anonymous], 2006, LREC 2006
[6]   DoD2007: 1082 molecular biology databases [J].
Babu, Padavala Ajay ;
Udyama, Juttada ;
Kumar, Rajam Kiran ;
Boddepalli, Radha ;
Mangala, Dhurjeti Sarva ;
Rao, Gollapalli Nageswara .
BIOINFORMATION, 2007, 2 (02) :64-67
[7]   Time to organize the bioinformatics resourceome [J].
Cannata, Nicola ;
Merelli, Emanuela ;
Altman, Russ B. .
PLOS COMPUTATIONAL BIOLOGY, 2005, 1 (07) :531-533
[8]   BIRI: a new approach for automatically discovering and indexing available public bioinformatics resources from the literature [J].
de la Calle, Guillermo ;
Garcia-Remesal, Miguel ;
Chiesa, Stefano ;
de la Iglesia, Diana ;
Maojo, Victor .
BMC BIOINFORMATICS, 2009, 10 :320
[9]   DBcat: a catalog of 500 biological databases [J].
Discala, C ;
Benigni, X ;
Barillot, E ;
Vaysseix, G .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :8-9
[10]  
Duck G., 2012, Proceedings of the 5th International Symposium on Semantic Mining in Biomedicine (SMBM), P2