GADEM: A Genetic Algorithm Guided Formation of Spaced Dyads Coupled with an EM Algorithm for Motif Discovery

被引:74
作者
Li, Leping [1 ]
机构
[1] Natl Inst Environm Hlth Sci, Biostat Branch, NIH, Res Triangle Pk, NC 27709 USA
关键词
ChIP; de novo motif discovery; expectation-maximization; genetic algorithm; k-mer; spaced dyad; FACTOR-BINDING-SITES; GENOME-WIDE ANALYSIS; REGULATORY ELEMENTS; CHROMATIN-IMMUNOPRECIPITATION; NONCODING SEQUENCES; DNA; PATTERNS; IDENTIFICATION;
D O I
10.1089/cmb.2008.16TT
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Genome-wide analyses of protein binding sites generate large amounts of data; a ChIP dataset might contain 10,000 sites. Unbiased motif discovery in such datasets is not generally feasible using current methods that employ probabilistic models. We propose an efficient method, GADEM, which combines spaced dyads and an expectation-maximization ( EM) algorithm. Candidate words (four to six nucleotides) for constructing spaced dyads are prioritized by their degree of overrepresentation in the input sequence data. Spaced dyads are converted into starting position weight matrices (PWMs). GADEM then employs a genetic algorithm (GA), with an embedded EM algorithm to improve starting PWMs, to guide the evolution of a population of spaced dyads toward one whose entropy scores are more statistically significant. Spaced dyads whose entropy scores reach a pre-specified significance threshold are declared motifs. GADEM performed comparably with MEME on 500 sets of simulated "ChIP" sequences with embedded known P53 binding sites. The major advantage of GADEM is its computational efficiency on large ChIP datasets compared to competitors. We applied GADEM to six genome-wide ChIP datasets. Approximately, 15 to 30 motifs of various lengths were identified in each dataset. Remarkably, without any prior motif information, the expected known motif ( e. g., P53 in P53 data) was identified every time. GADEM discovered motifs of various lengths (6-40 bp) and characteristics in these datasets containing from 0.5 to > 13 million nucleotides with run times of 5 to 96 h. GADEM can be viewed as an extension of the well-known MEME algorithm and is an efficient tool for de novo motif discovery in large-scale genome-wide data. The GADEM software is available at www.niehs.nih.gov/research/resources/software/GADEM/.
引用
收藏
页码:317 / 329
页数:13
相关论文
共 51 条
  • [1] Bailey T. L., 1994, Proc. Int. Conf. Intell. Syst. Mol. Biol., V2, P28
  • [2] Methods and statistics for combining motif match scores
    Bailey, TL
    Gribskov, M
    [J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 1998, 5 (02) : 211 - 221
  • [3] A bivalent chromatin structure marks key developmental genes in embryonic stem cells
    Bernstein, BE
    Mikkelsen, TS
    Xie, XH
    Kamal, M
    Huebert, DJ
    Cuff, J
    Fry, B
    Meissner, A
    Wernig, M
    Plath, K
    Jaenisch, R
    Wagschal, A
    Feil, R
    Schreiber, SL
    Lander, ES
    [J]. CELL, 2006, 125 (02) : 315 - 326
  • [4] Core transcriptional regulatory circuitry in human embryonic stem cells
    Boyer, LA
    Lee, TI
    Cole, MF
    Johnstone, SE
    Levine, SS
    Zucker, JR
    Guenther, MG
    Kumar, RM
    Murray, HL
    Jenner, RG
    Gifford, DK
    Melton, DA
    Jaenisch, R
    Young, RA
    [J]. CELL, 2005, 122 (06) : 947 - 956
  • [5] Finding motifs using random projections
    Buhler, J
    Tompa, M
    [J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 2002, 9 (02) : 225 - 242
  • [6] Genome-wide analysis of estrogen receptor binding sites
    Carroll, Jason S.
    Meyer, Clifford A.
    Song, Jun
    Li, Wei
    Geistlinger, Timothy R.
    Eeckhoute, Jerome
    Brodsky, Alexander S.
    Keeton, Erika Krasnickas
    Fertuck, Kirsten C.
    Hall, Giles F.
    Wang, Qianben
    Bekiranov, Stefan
    Sementchenko, Victor
    Fox, Edward A.
    Silver, Pamela A.
    Gingeras, Thomas R.
    Liu, X. Shirley
    Brown, Myles
    [J]. NATURE GENETICS, 2006, 38 (11) : 1289 - 1297
  • [7] NestedMICA: sensitive inference of over-represented motifs in nucleic acid sequence
    Down, TA
    Hubbard, TJP
    [J]. NUCLEIC ACIDS RESEARCH, 2005, 33 (05) : 1445 - 1453
  • [8] Discovering motifs in ranked lists of DNA sequences
    Eden, Eran
    Lipson, Doron
    Yogev, Sivan
    Yakhini, Zohar
    [J]. PLOS COMPUTATIONAL BIOLOGY, 2007, 3 (03) : 508 - 522
  • [9] A universal framework for regulatory element discovery across all Genomes and data types
    Elemento, Olivier
    Slonim, Noam
    Tavazoie, Saeed
    [J]. MOLECULAR CELL, 2007, 28 (02) : 337 - 350
  • [10] ESKIN E, 2004, RECOMB 04, P115