Realistic artificial DNA sequences as negative controls for computational genomics

被引:22
作者
Caballero, Juan [1 ]
Smit, Arian F. A. [1 ]
Hood, Leroy [1 ]
Glusman, Gustavo [1 ]
机构
[1] Inst Syst Biol, Seattle, WA 98109 USA
基金
美国国家卫生研究院;
关键词
GENE PREDICTION; MAMMALIAN EVOLUTION; INITIAL SEQUENCE; ENCODE PROJECT; DRAFT GENOME; INSIGHTS; ALIGNMENT; ELEMENTS; DATABASE; TRANSCRIPTION;
D O I
10.1093/nar/gku356
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
A common practice in computational genomic analysis is to use a set of 'background' sequences as negative controls for evaluating the false-positive rates of prediction tools, such as gene identification programs and algorithms for detection of cis-regulatory elements. Such 'background' sequences are generally taken from regions of the genome presumed to be intergenic, or generated synthetically by 'shuffling' real sequences. This last method can lead to underestimation of false-positive rates. We developed a new method for generating artificial sequences that are modeled after real intergenic sequences in terms of composition, complexity and interspersed repeat content. These artificial sequences can serve as an inexhaustible source of high-quality negative controls. We used artificial sequences to evaluate the false-positive rates of a set of programs for detecting interspersed repeats, ab initio prediction of coding genes, transcribed regions and non-coding genes. We found that Repeat-Masker is more accurate than PClouds, Augustus has the lowest false-positive rate of the coding gene prediction programs tested, and Infernal has a low false-positive rate for non-coding gene detection. A web service, source code and the models for human and many other species are freely available at http://repeatmasker.org/garlic/.
引用
收藏
页数:12
相关论文
共 67 条
[1]   The genome sequence of Drosophila melanogaster [J].
Adams, MD ;
Celniker, SE ;
Holt, RA ;
Evans, CA ;
Gocayne, JD ;
Amanatides, PG ;
Scherer, SE ;
Li, PW ;
Hoskins, RA ;
Galle, RF ;
George, RA ;
Lewis, SE ;
Richards, S ;
Ashburner, M ;
Henderson, SN ;
Sutton, GG ;
Wortman, JR ;
Yandell, MD ;
Zhang, Q ;
Chen, LX ;
Brandon, RC ;
Rogers, YHC ;
Blazej, RG ;
Champe, M ;
Pfeiffer, BD ;
Wan, KH ;
Doyle, C ;
Baxter, EG ;
Helt, G ;
Nelson, CR ;
Miklos, GLG ;
Abril, JF ;
Agbayani, A ;
An, HJ ;
Andrews-Pfannkoch, C ;
Baldwin, D ;
Ballew, RM ;
Basu, A ;
Baxendale, J ;
Bayraktaroglu, L ;
Beasley, EM ;
Beeson, KY ;
Benos, PV ;
Berman, BP ;
Bhandari, D ;
Bolshakov, S ;
Borkova, D ;
Botchan, MR ;
Bouck, J ;
Brokstein, P .
SCIENCE, 2000, 287 (5461) :2185-2195
[2]   The genome of the green anole lizard and a comparative analysis with birds and mammals [J].
Alfoeldi, Jessica ;
Di Palma, Federica ;
Grabherr, Manfred ;
Williams, Christina ;
Kong, Lesheng ;
Mauceli, Evan ;
Russell, Pamela ;
Lowe, Craig B. ;
Glor, Richard E. ;
Jaffe, Jacob D. ;
Ray, David A. ;
Boissinot, Stephane ;
Shedlock, Andrew M. ;
Botka, Christopher ;
Castoe, Todd A. ;
Colbourne, John K. ;
Fujita, Matthew K. ;
Moreno, Ricardo Godinez ;
ten Hallers, Boudewijn F. ;
Haussler, David ;
Heger, Andreas ;
Heiman, David ;
Janes, Daniel E. ;
Johnson, Jeremy ;
de Jong, Pieter J. ;
Koriabine, Maxim Y. ;
Lara, Marcia ;
Novick, Peter A. ;
Organ, Chris L. ;
Peach, Sally E. ;
Poe, Steven ;
Pollock, David D. ;
de Queiroz, Kevin ;
Sanger, Thomas ;
Searle, Steve ;
Smith, Jeremy D. ;
Smith, Zachary ;
Swofford, Ross ;
Turner-Maier, Jason ;
Wade, Juli ;
Young, Sarah ;
Zadissa, Amonida ;
Edwards, Scott V. ;
Glenn, Travis C. ;
Schneider, Christopher J. ;
Losos, Jonathan B. ;
Lander, Eric S. ;
Breen, Matthew ;
Ponting, Chris P. ;
Lindblad-Toh, Kerstin .
NATURE, 2011, 477 (7366) :587-591
[3]   JIGSAW: integration of multiple sources of evidence for gene prediction [J].
Allen, JE ;
Salzberg, SL .
BIOINFORMATICS, 2005, 21 (18) :3596-3603
[4]  
ALTSCHUL SF, 1985, MOL BIOL EVOL, V2, P526
[5]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[6]   Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes [J].
Aparicio, S ;
Chapman, J ;
Stupka, E ;
Putnam, N ;
Chia, J ;
Dehal, P ;
Christoffels, A ;
Rash, S ;
Hoon, S ;
Smit, A ;
Gelpke, MDS ;
Roach, J ;
Oh, T ;
Ho, IY ;
Wong, M ;
Detter, C ;
Verhoef, F ;
Predki, P ;
Tay, A ;
Lucas, S ;
Richardson, P ;
Smith, SF ;
Clark, MS ;
Edwards, YJK ;
Doggett, N ;
Zharkikh, A ;
Tavtigian, SV ;
Pruss, D ;
Barnstead, M ;
Evans, C ;
Baden, H ;
Powell, J ;
Glusman, G ;
Rowen, L ;
Hood, L ;
Tan, YH ;
Elgar, G ;
Hawkins, T ;
Venkatesh, B ;
Rokhsar, D ;
Brenner, S .
SCIENCE, 2002, 297 (5585) :1301-1310
[7]   Tandem repeats finder: a program to analyze DNA sequences [J].
Benson, G .
NUCLEIC ACIDS RESEARCH, 1999, 27 (02) :573-580
[8]   How does eukaryotic gene prediction work? [J].
Brent, Michael R. .
NATURE BIOTECHNOLOGY, 2007, 25 (08) :883-885
[9]   Prediction of complete gene structures in human genomic DNA [J].
Burge, C ;
Karlin, S .
JOURNAL OF MOLECULAR BIOLOGY, 1997, 268 (01) :78-94
[10]   Genome sequence of the nematode C-elegans:: A platform for investigating biology [J].
不详 .
SCIENCE, 1998, 282 (5396) :2012-2018