Large expert-curated database for benchmarking document similarity detection in biomedical literature search

被引:27
作者
Brown, Peter [1 ]
Tan, Aik-Choon [3 ]
El-Esawi, Mohamed A. [4 ]
Liehr, Thomas [5 ]
Blanck, Oliver [6 ]
Gladue, Douglas P. [7 ]
Almeida, Gabriel M. F. [8 ]
Cernava, Tomislav [9 ]
Sorzano, Carlos O. [10 ]
Yeung, Andy W. K. [11 ]
Engel, Michael S. [12 ]
Chandrasekaran, Arun Richard [13 ]
Muth, Thilo [14 ]
Staege, Martin S. [15 ]
Daulatabad, Swapna V. [16 ]
Widera, Darius [17 ]
Zhang, Junpeng [18 ]
Meule, Adrian [19 ,887 ]
Honjo, Ken [20 ]
Pourret, Olivier [21 ]
Yin, Cong-Cong [22 ]
Zhang, Zhongheng [23 ]
Cascella, Marco [24 ]
Flegel, Willy A. [25 ]
Goodyear, Carl S. [26 ]
van Raaij, Mark J. [10 ]
Bukowy-Bieryllo, Zuzanna [27 ]
Campana, Luca G. [28 ]
Kurniawan, Nicholas A. [29 ]
Lalaouna, David [30 ]
Huttner, Felix J. [31 ]
Ammerman, Brooke A. [32 ]
Ehret, Felix [33 ]
Cobine, Paul A. [34 ]
Tan, Ene-Choo [35 ]
Han, Hyemin [36 ]
Xia, Wenfeng [37 ]
McCrum, Christopher [38 ]
Dings, Ruud P. M. [39 ]
Marinello, Francesco [40 ]
Nilsson, Henrik [41 ]
Nixon, Brett [42 ]
Voskarides, Konstantinos [43 ]
Yang, Long [44 ]
Costa, Vincent D. [45 ]
Bengtsson-Palme, Johan [46 ]
Bradshaw, William [47 ]
Grimm, Dominik G. [48 ]
Kumar, Nitin [49 ]
Martis, Elvis [50 ]
机构
[1] Griffith Univ, Sch Informat & Commun Technol, Gold Coast, Qld 4222, Australia
[2] Griffith Univ, Inst Glyc, Gold Coast, Qld 4222, Australia
[3] Univ Colorado, Dept Med Med Oncol, Anschutz Med Campus, Denver, CO USA
[4] Tanta Univ, Fac Sci, Bot Dept, Tanta, Egypt
[5] Friedrich Schiller Univ, Jena Univ Hosp, Inst Human Genet, Jena, Germany
[6] Univ Med Ctr Schleswig Holstein, Dept Radiat Oncol, Campus Kiel, Kiel, Germany
[7] ARS, USDA, Plum Isl Anim Dis Ctr, Greenport, NY 11944 USA
[8] Univ Jyvaskyla, Dept Biol & Environm Sci, Jyvaskyla, Finland
[9] Graz Univ Technol, Inst Environm Biotechnol, Graz, Austria
[10] CSIC, CNB, Natl Biotechnol Ctr, Dept Macromol Struct, Madrid, Spain
[11] Univ Hong Kong, Fac Dent, Oral & Maxillofacial Radiol Appl Oral Sci & Commu, Hong Kong, Peoples R China
[12] Univ Kansas, Div Entomol, Biodivers Inst, Lawrence, KS 66045 USA
[13] SUNY Albany, RNA Inst, Albany, NY 12222 USA
[14] Robert Koch Inst, Dept Methods Dev & Res Infrastruct, Berlin, Germany
[15] Martin Luther Univ Halle Wittenberg, Dept Surg & Conservat Pediat & Adolescent Med, Halle, Germany
[16] Indiana Univ Purdue Univ, IU Sch Informat & Comp, Dept BioHlth Informat, Indianapolis, IN 46202 USA
[17] Univ Reading, Sch Pharm Stem Cell Biol & Regenerat Med, Reading, Berks, England
[18] Dali Univ, Sch Engn, Dali City, Yunnan, Peoples R China
[19] Univ Hosp Munich LMU, Dept Psychiat & Psychotherapy, Munich, Germany
[20] Univ Tsukuba, Fac Life & Environm Sci, Ibaraki, Japan
[21] UniLaSalle, Aghyle, Beauvais, France
[22] Henry Ford Hlth Syst, Dept Immunol, Detroit, MI USA
[23] Zhejiang Univ, Sch Med, Sir Run Run Shaw Hosp, Dept Emergency, Hangzhou 310016, Zhejiang, Peoples R China
[24] Ist Nazl Tumori Fdn Pascale IRCCS, Anesthesia & Pain Med, Naples, Italy
[25] NIH, Dept Transfus Med, Bethesda, MD USA
[26] Univ Glasgow, Inst Infect Immun & Inflammat, Glasgow, Lanark, Scotland
[27] Polish Acad Sci, Inst Human Genet, Poznan, Poland
[28] Univ Padua, Dept Surg Oncol & Gastroenterol DISCOG, Padua, Italy
[29] Eindhoven Univ Technol, Biomed Engn, Eindhoven, Netherlands
[30] Univ Strasbourg, IBMC, Strasbourg, France
[31] Heidelberg Univ, Dept Gen Visceral & Transplantat Surg, Heidelberg, Germany
[32] Univ Notre Dame, Psychol, Notre Dame, IN 46556 USA
[33] Harvard Med Sch, Massachusetts Gen Hosp, Radiol & Pathol, Boston, MA 02115 USA
[34] Auburn Univ, Dept Biol Sci, Auburn, AL 36849 USA
[35] KK Womens & Childrens Hosp, KK Res Ctr, Singapore, Singapore
[36] Univ Alabama, Educ Psychol, Tuscaloosa, AL USA
[37] UCL, Wellcome EPSRC Ctr Intervent & Surg Sci, London, England
[38] Maastricht Univ, Dept Nutr & Movement Sci, Maastricht, Netherlands
[39] Univ Arkansas Med Sci, Dept Radiat Oncol, Little Rock, AR 72205 USA
[40] Univ Padua, Dept Land Environm Agr & Forestry, Padua, Italy
[41] Univ Gothenburg, Dept Biol & Environm Sci, Gothenburg, Sweden
[42] Univ Newcastle, Prior Res Ctr Reprod Sci, Callaghan, NSW, Australia
[43] Univ Cyprus, Med Sch, Nicosia, Cyprus
[44] Shandong Agr Univ, Coll Plant Protect, Agr Big Data Res Ctr, Tai An, Shandong, Peoples R China
[45] NIMH, Neuropsychol Lab, Bldg 9, Bethesda, MD 20892 USA
[46] Univ Wisconsin, Wisconsin Inst Discovery, Madison, WI USA
[47] Univ Oxford, Struct Genom Consortium, Oxford, England
[48] Weihenstephan Triesdorf Univ Appl Sci, TUM Campus Straubing Biotechnol & Sustainabil, Bioinformat, Straubing, Germany
[49] Univ Michigan, Cardiovasc Res, Ann Arbor, MI 48109 USA
[50] Bombay Coll Pharm, Pharmaceut Chem, Mumbai, Maharashtra, India
来源
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION | 2019年
关键词
RECOMMENDER-SYSTEMS;
D O I
10.1093/database/baz085
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Document recommendation systems for locating relevant literature have mostly relied on methods developed a decade ago. This is largely due to the lack of a large offline gold-standard benchmark of relevant documents that cover a variety of research fields such that newly developed literature search techniques can be compared, improved and translated into practice. To overcome this bottleneck, we have established the RElevant LIterature SearcH consortium consisting of more than 1500 scientists from 84 countries, who have collectively annotated the relevance of over 180 000 PubMed-listed articles with regard to their respective seed (input) article/s. The majority of annotations were contributed by highly experienced, original authors of the seed articles. The collected data cover 76% of all unique PubMed Medical Subject Headings descriptors. No systematic biases were observed across different experience levels, research fields or time spent on annotations. More importantly, annotations of the same document pairs contributed by different scientists were highly concordant. We further show that the three representative baseline methods used to generate recommended articles for evaluation (Okapi Best Matching 25, Term Frequency-Inverse Document Frequency and PubMed Related Articles) had similar overall performances. Additionally, we found that these methods each tend to produce distinct collections of recommended articles, suggesting that a hybrid method may be required to completely capture all relevant articles. The established database server located at https://relishdb.ict.griffith.edu.au is freely available for the downloading of annotation data and the blind testing of new methods. We expect that this benchmark will be useful for stimulating the development of new powerful techniques for title and title/abstract-based search engines for relevant articles in biomedical research.
引用
收藏
页码:1 / 67
页数:66
相关论文
共 75 条
[1]  
Acland A, 2013, NUCLEIC ACIDS RES, V41, pD8, DOI [10.1093/nar/gkx1095, 10.1093/nar/gks1189, 10.1093/nar/gkq1172]
[2]  
Agarwala R, 2016, NUCLEIC ACIDS RES, V44, pD7, DOI [10.1093/nar/gkv1290, 10.1093/nar/gku1130]
[3]   BOINC: A system for public-resource computing and storage [J].
Anderson, DP .
FIFTH IEEE/ACM INTERNATIONAL WORKSHOP ON GRID COMPUTING, PROCEEDINGS, 2004, :4-10
[4]  
[Anonymous], 2009, Encyclopedia of Database Systems, DOI DOI 10.1007/978-0-387-39940-9_484
[5]  
BaezaYates R, 2004, LECT NOTES COMPUT SC, V3268, P588
[6]  
Bao GH, 2018, ADV HIGH-SPEED RAIL, P251, DOI 10.1007/978-981-10-5610-9_15
[7]   Research-paper recommender systems: a literature survey [J].
Beel, Joeran ;
Gipp, Bela ;
Langer, Stefan ;
Breitinger, Corinna .
INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES, 2016, 17 (04) :305-338
[8]   Towards reproducibility in recommender-systems research [J].
Beel, Joeran ;
Breitinger, Corinna ;
Langer, Stefan ;
Lommatzsch, Andreas ;
Gipp, Bela .
USER MODELING AND USER-ADAPTED INTERACTION, 2016, 26 (01) :69-101
[9]   Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric [J].
Boughorbel, Sabri ;
Jarray, Fethi ;
El-Anbari, Mohammed .
PLOS ONE, 2017, 12 (06)
[10]   Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches [J].
Boyack, Kevin W. ;
Newman, David ;
Duhon, Russell J. ;
Klavans, Richard ;
Patek, Michael ;
Biberstine, Joseph R. ;
Schijvenaars, Bob ;
Skupin, Andre ;
Ma, Nianli ;
Boerner, Katy .
PLOS ONE, 2011, 6 (03)