Automated gene-model curation using global discriminative learning

被引:4
作者
Bernal, Axel [1 ]
Crammer, Koby [2 ]
Pereira, Fernando [1 ,3 ]
机构
[1] Univ Penn, Dept Comp & Informat Sci, Philadelphia, PA 19104 USA
[2] Technion Israel Inst Technol, Dept Elect Engn, IL-32000 Haifa, Israel
[3] Google Inc, Mountain View, CA 94043 USA
基金
美国国家科学基金会;
关键词
MULTIPLE SOURCES; INTEGRATION; FRAMEWORK;
D O I
10.1093/bioinformatics/bts176
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Gene-model curation creates consensus gene models by combining multiple sources of protein-coding evidence that may be incomplete or inconsistent. To date, manual curation still produces the highest quality models. However, manual curation is too slow and costly to be completed even for the most important organisms. In recent years, machine-learned ensemble gene predictors have become a viable alternative to manual curation. Current approaches make use of signal and genomic region consistency among sources and some voting scheme to resolve conflicts in the evidence. As a further step in that direction, we have developed eCRAIG (ensemble CRAIG), an automated curation tool that combines multiple sources of evidence using global discriminative training. This allows efficient integration of different types of genomic evidence with complex statistical dependencies to maximize directly annotation accuracy. Our method goes beyond previous work in integrating novel nonlinear annotation agreement features, as well as combinations of intrinsic features of the target sequence and extrinsic annotation features. Results: We achieved significant improvements over the best ensemble predictors available for Homo sapiens, Caenorhabditis elegans and Arabidopsis thaliana. In particular, eCRAIG achieved a relative mean improvement of 5.1% over Jigsaw, the best published ensemble predictor in all our experiments.
引用
收藏
页码:1571 / 1578
页数:8
相关论文
共 22 条
  • [1] JIGSAW: integration of multiple sources of evidence for gene prediction
    Allen, JE
    Salzberg, SL
    [J]. BIOINFORMATICS, 2005, 21 (18) : 3596 - 3603
  • [2] Allen JE, 2004, GENOME RES, V14, P142, DOI 10.1101/gr.1562804
  • [3] [Anonymous], 2008, Proceedings of the 25th international conference on Machine learning, DOI DOI 10.1145/1390156.1390190
  • [4] Pairagon plus N-SCAN_EST: a model-based gene annotation pipeline
    Arumugam, Manimozhiyan
    Wei, Chaochun
    Brown, Randall H.
    Brent, Michael R.
    [J]. GENOME BIOLOGY, 2006, 7 (Suppl 1)
  • [5] Manual curation is not sufficient for annotation of genomic databases
    Baumgartner, William A., Jr.
    Cohen, K. Bretonnel
    Fox, Lynne M.
    Acquaah-Mensah, George
    Hunter, Lawrence
    [J]. BIOINFORMATICS, 2007, 23 (13) : I41 - I48
  • [6] Global discriminative learning for higher-accuracy computational gene prediction
    Bernal, Axel
    Crammer, Koby
    Hatzigeorgiou, Artemis
    Pereira, Fernando
    [J]. PLOS COMPUTATIONAL BIOLOGY, 2007, 3 (03) : 488 - 497
  • [7] ExonHunter:: a comprehensive approach to gene finding
    Brejová, B
    Brown, DG
    Li, M
    Vinar, T
    [J]. BIOINFORMATICS, 2005, 21 : I57 - I65
  • [8] Finding the genes in genomic DNA
    Burge, CB
    Karlin, S
    [J]. CURRENT OPINION IN STRUCTURAL BIOLOGY, 1998, 8 (03) : 346 - 354
  • [9] nGASP - the nematode genome annotation assessment project
    Coghlan, Avril
    Fiedler, Tristan J.
    Mckay, Sheldon J.
    Flicek, Paul
    Harris, Todd W.
    Blasiar, Darin
    Stein, Lincoln D.
    [J]. BMC BIOINFORMATICS, 2008, 9 (1)
  • [10] Crammer K, 2006, J MACH LEARN RES, V7, P551