The gene normalization task in BioCreative III

被引:74
作者
Lu, Zhiyong [1 ]
Kao, Hung-Yu [2 ]
Wei, Chih-Hsuan [2 ]
Huang, Minlie [3 ]
Liu, Jingchen [3 ]
Kuo, Cheng-Ju [4 ]
Hsu, Chun-Nan [4 ,5 ]
Tsai, Richard Tzong-Han [6 ]
Dai, Hong-Jie [4 ,7 ]
Okazaki, Naoaki [8 ]
Cho, Han-Cheol [9 ]
Gerner, Martin [10 ]
Solt, Illes [11 ]
Agarwal, Shashank [12 ]
Liu, Feifan [12 ]
Vishnyakova, Dina [13 ]
Ruch, Patrick [14 ]
Romacker, Martin [15 ]
Rinaldi, Fabio [16 ]
Bhattacharya, Sanmitra [17 ]
Srinivasan, Padmini [17 ]
Liu, Hongfang [18 ]
Torii, Manabu [19 ]
Matos, Sergio [20 ]
Campos, David [20 ]
Verspoor, Karin [21 ]
Livingston, Kevin M. [21 ]
Wilbur, W. John [1 ]
机构
[1] Natl Lib Med, Natl Ctr Biotechnol Informat, Bethesda, MD 20894 USA
[2] Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan 70101, Taiwan
[3] Tsinghua Univ, Dept Comp Sci & Technol, Beijing 100084, Peoples R China
[4] Acad Sinica, Inst Informat Sci, Taipei 115, Taiwan
[5] Univ So Calif, Inst Informat Sci, Marina Del Rey, CA 90292 USA
[6] Yuan Ze Univ, Dept Comp Sci & Engn, Chungli, Taiwan
[7] Natl Tsing Hua Univ, Dept Comp Sci, Hsinchu 30043, Taiwan
[8] Univ Tokyo, Interfac Initiat Informat Studies, Tokyo 1138654, Japan
[9] Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo 1138654, Japan
[10] Univ Manchester, Fac Life Sci, Manchester M13 9PT, Lancs, England
[11] Budapest Univ Technol & Econ, Dept Telecommun & Media Informat, H-1117 Budapest, Hungary
[12] Univ Wisconsin, Milwaukee, WI 53201 USA
[13] Univ Geneva, Div Med Informat Sci, BiTem Grp, CH-1211 Geneva 4, Switzerland
[14] Univ Appl Sci, Dept Informat Sci, BiTeM Grp, Geneva, Switzerland
[15] Novartis AG, NITAS TMS, Text Min Serv, Basel, Switzerland
[16] Univ Zurich, Inst Computat Linguist, Zurich, Switzerland
[17] Univ Iowa, Dept Comp Sci, Iowa City, IA 52242 USA
[18] Mayo Clin, Coll Med, Dept Hlth Sci Res, Rochester, MN 55905 USA
[19] Georgetown Univ, Med Ctr, Lab Text Intelligence Biomed, Washington, DC 20057 USA
[20] Univ Aveiro, DETI IEETA, P-3810193 Aveiro, Portugal
[21] Univ Colorado, Sch Med, Ctr Computat Pharmacol, Aurora, CO USA
基金
瑞士国家科学基金会; 美国国家科学基金会;
关键词
BIOINFORMATICS; ONTOGENE; PROTEIN; MODELS; NAMES; II.5; TOOL; IB;
D O I
10.1186/1471-2105-12-S8-S2
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k). Results: We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively. Conclusions: By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance.
引用
收藏
页数:19
相关论文
共 45 条
[1]  
[Anonymous], 2009, Proceedings of the 12th conference of the European chapter of the Association for Computational Linguistics, DOI DOI 10.3115/1609067.1609070
[2]  
[Anonymous], 1998, P 7 INT C WORLD WID
[3]  
[Anonymous], 2008, P 14 ACM SIGKDD INT
[4]  
[Anonymous], 1998, LEARNING TEXT CATEGO
[5]  
[Anonymous], 2008, P 2008 C EMPIRICAL M
[6]   Gene Ontology: tool for the unification of biology [J].
Ashburner, M ;
Ball, CA ;
Blake, JA ;
Botstein, D ;
Butler, H ;
Cherry, JM ;
Davis, AP ;
Dolinski, K ;
Dwight, SS ;
Eppig, JT ;
Harris, MA ;
Hill, DP ;
Issel-Tarver, L ;
Kasarskis, A ;
Lewis, S ;
Matese, JC ;
Richardson, JE ;
Ringwald, M ;
Rubin, GM ;
Sherlock, G .
NATURE GENETICS, 2000, 25 (01) :25-29
[7]  
Bhattacharya S., 2010, Proceedings of the BioCreative III workshop, P55
[8]   Threshold Average Precision (TAP-k): a measure of retrieval designed for bioinformatics [J].
Carroll, Hyrum D. ;
Kann, Maricel G. ;
Sheetlin, Sergey L. ;
Spouge, John L. .
BIOINFORMATICS, 2010, 26 (14) :1708-1713
[9]   Data preparation and interannotator agreement: BioCreAtIvE task IB [J].
Colosimo, ME ;
Morgan, AA ;
Yeh, AS ;
Colombe, JB ;
Hirschman, L .
BMC BIOINFORMATICS, 2005, 6 (Suppl 1)
[10]   Multistage Gene Normalization and SVM-Based Ranking for Protein Interactor Extraction in Full-Text Articles [J].
Dai, Hong-Jie ;
Lai, Po-Ting ;
Tsai, Richard Tzong-Han .
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2010, 7 (03) :412-420