The Limits of De Novo DNA Motif Discovery

被引:21
|
作者
Simcha, David [1 ]
Price, Nathan D. [2 ]
Geman, Donald [3 ]
机构
[1] Johns Hopkins Univ, Dept Biomed Engn, Baltimore, MD 21218 USA
[2] Inst Syst Biol, Seattle, WA USA
[3] Johns Hopkins Univ, Dept Appl Math & Stat, Baltimore, MD USA
来源
PLOS ONE | 2012年 / 7卷 / 11期
基金
美国国家卫生研究院;
关键词
FACTOR-BINDING SITES; GENE-EXPRESSION; HUMAN GENOME; SEQUENCES; PROFILES; ELEMENTS; TOOLS; MAP;
D O I
10.1371/journal.pone.0047836
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
A major challenge in molecular biology is reverse-engineering the cis-regulatory logic that plays a major role in the control of gene expression. This program includes searching through DNA sequences to identify "motifs" that serve as the binding sites for transcription factors or, more generally, are predictive of gene expression across cellular conditions. Several approaches have been proposed for de novo motif discovery-searching sequences without prior knowledge of binding sites or nucleotide patterns. However, unbiased validation is not straightforward. We consider two approaches to unbiased validation of discovered motifs: testing the statistical significance of a motif using a DNA "background" sequence model to represent the null hypothesis and measuring performance in predicting membership in gene clusters. We demonstrate that the background models typically used are "too null," resulting in overly optimistic assessments of significance, and argue that performance in predicting TF binding or expression patterns from DNA motifs should be assessed by held-out data, as in predictive learning. Applying this criterion to common motif discovery methods resulted in universally poor performance, although there is a marked improvement when motifs are statistically significant against real background sequences. Moreover, on synthetic data where "ground truth" is known, discriminative performance of all algorithms is far below the theoretical upper bound, with pronounced "over-fitting" in training. A key conclusion from this work is that the failure of de novo discovery approaches to accurately identify motifs is basically due to statistical intractability resulting from the fixed size of co-regulated gene clusters, and thus such failures do not necessarily provide evidence that unfound motifs are not active biologically. Consequently, the use of prior knowledge to enhance motif discovery is not just advantageous but necessary. An implementation of the LR and ALR algorithms is available at http://code.google.com/p/likelihood-ratio-motifs/.
引用
收藏
页数:9
相关论文
共 50 条
  • [1] A Clustering-Based Algorithm for De Novo Motif Discovery in DNA Sequences
    Ebrahim-Abadi, Mohammad Haghir
    Fatemizadeh, Emad
    2017 24TH NATIONAL AND 2ND INTERNATIONAL IRANIAN CONFERENCE ON BIOMEDICAL ENGINEERING (ICBME), 2017, : 267 - 272
  • [2] Memetic Algorithms for De Novo Motif Discovery
    Chan, Tak-Ming
    Leung, Kwong-Sak
    Lee, Kin-Hong
    IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2012, 16 (05) : 730 - 748
  • [3] PairMotif plus : A Fast and Effective Algorithm for De Novo Motif Discovery in DNA sequences
    Yu, Qiang
    Huo, Hongwei
    Zhang, Yipu
    Guo, Hongzhi
    Guo, Haitao
    INTERNATIONAL JOURNAL OF BIOLOGICAL SCIENCES, 2013, 9 (04): : 412 - 424
  • [4] MotifHyades: expectation maximization for de novo DNA motif pair discovery on paired sequences
    Wong, Ka-Chun
    BIOINFORMATICS, 2017, 33 (19) : 3028 - 3035
  • [5] Greedy de novo motif discovery to construct motif repositories for bacterial proteomes
    Khakzad, Hamed
    Malmstrom, Johan
    Malmstrom, Lars
    BMC BIOINFORMATICS, 2019, 20 (Suppl 4)
  • [6] Greedy de novo motif discovery to construct motif repositories for bacterial proteomes
    Hamed Khakzad
    Johan Malmström
    Lars Malmström
    BMC Bioinformatics, 20
  • [7] SCOPE:: a web server for practical de novo motif discovery
    Carlson, Jonathan M.
    Chakravarty, Arijit
    DeZiel, Charles E.
    Gross, Robert H.
    NUCLEIC ACIDS RESEARCH, 2007, 35 : W259 - W264
  • [8] Nucleosome occupancy information improves de novo motif discovery
    Narlikar, Leelavati
    Gordan, Raluca
    Hartemink, Alexander J.
    RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, PROCEEDINGS, 2007, 4453 : 107 - +
  • [9] Trawler:: de novo regulatory motif discovery pipeline for chromatin immunoprecipitation
    Ettwiller, Laurence
    Paten, Benedict
    Ramialison, Mirana
    Birney, Ewan
    Wittbrodt, Joachim
    NATURE METHODS, 2007, 4 (07) : 563 - 565
  • [10] Trawler: De novo regulatory motif discovery pipeline for chromatin immunoprecipitation
    Ettwiller L.
    Paten B.
    Ramialison M.
    Birney E.
    Wittbrodt J.
    Nature Methods, 2007, 4 (7) : 563 - 565