The Limits of De Novo DNA Motif Discovery

被引:21
|
作者
Simcha, David [1 ]
Price, Nathan D. [2 ]
Geman, Donald [3 ]
机构
[1] Johns Hopkins Univ, Dept Biomed Engn, Baltimore, MD 21218 USA
[2] Inst Syst Biol, Seattle, WA USA
[3] Johns Hopkins Univ, Dept Appl Math & Stat, Baltimore, MD USA
来源
PLOS ONE | 2012年 / 7卷 / 11期
基金
美国国家卫生研究院;
关键词
FACTOR-BINDING SITES; GENE-EXPRESSION; HUMAN GENOME; SEQUENCES; PROFILES; ELEMENTS; TOOLS; MAP;
D O I
10.1371/journal.pone.0047836
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
A major challenge in molecular biology is reverse-engineering the cis-regulatory logic that plays a major role in the control of gene expression. This program includes searching through DNA sequences to identify "motifs" that serve as the binding sites for transcription factors or, more generally, are predictive of gene expression across cellular conditions. Several approaches have been proposed for de novo motif discovery-searching sequences without prior knowledge of binding sites or nucleotide patterns. However, unbiased validation is not straightforward. We consider two approaches to unbiased validation of discovered motifs: testing the statistical significance of a motif using a DNA "background" sequence model to represent the null hypothesis and measuring performance in predicting membership in gene clusters. We demonstrate that the background models typically used are "too null," resulting in overly optimistic assessments of significance, and argue that performance in predicting TF binding or expression patterns from DNA motifs should be assessed by held-out data, as in predictive learning. Applying this criterion to common motif discovery methods resulted in universally poor performance, although there is a marked improvement when motifs are statistically significant against real background sequences. Moreover, on synthetic data where "ground truth" is known, discriminative performance of all algorithms is far below the theoretical upper bound, with pronounced "over-fitting" in training. A key conclusion from this work is that the failure of de novo discovery approaches to accurately identify motifs is basically due to statistical intractability resulting from the fixed size of co-regulated gene clusters, and thus such failures do not necessarily provide evidence that unfound motifs are not active biologically. Consequently, the use of prior knowledge to enhance motif discovery is not just advantageous but necessary. An implementation of the LR and ALR algorithms is available at http://code.google.com/p/likelihood-ratio-motifs/.
引用
收藏
页数:9
相关论文
共 50 条
  • [21] BindVAE: Dirichlet variational autoencoders for de novo motif discovery from accessible chromatin
    Meghana Kshirsagar
    Han Yuan
    Juan Lavista Ferres
    Christina Leslie
    Genome Biology, 23
  • [22] A review of ensemble methods for de novo motif discovery in ChIP-Seq data
    Lihu, Andrei
    Holban, Stefan
    BRIEFINGS IN BIOINFORMATICS, 2015, 16 (06) : 964 - 973
  • [23] DISPOM: A DISCRIMINATIVE DE-NOVO MOTIF DISCOVERY TOOL BASED ON THE JS']JSTACS LIBRARY
    Grau, Jan
    Keilwagen, Jens
    Gohr, Andre
    Paponov, Ivan A.
    Posch, Stefan
    Seifert, Michael
    Strickert, Marc
    Grosse, Ivo
    JOURNAL OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, 2013, 11 (01)
  • [24] A general approach for discriminative de novo motif discovery from high-throughput data
    Grau, Jan
    Posch, Stefan
    Grosse, Ivo
    Keilwagen, Jens
    NUCLEIC ACIDS RESEARCH, 2013, 41 (21)
  • [25] AptCompare: optimized de novo motif discovery of RNA aptamers via HTS-SELEX
    Shieh, Kevin R.
    Kratschmer, Christina
    Maier, Keith E.
    Greally, John M.
    Levy, Matthew
    Golden, Aaron
    BIOINFORMATICS, 2020, 36 (09) : 2905 - 2906
  • [26] De novo motif discovery facilitates identification of interactions between transcription factors in Saccharomyces cerevisiae
    Chen, Mei-Ju May
    Chou, Lih-Ching
    Hsieh, Tsung-Ting
    Lee, Ding-Dar
    Liu, Kai-Wei
    Yu, Chi-Yuan
    Oyang, Yen-Jen
    Tsai, Huai-Kuang
    Chen, Chien-Yu
    BIOINFORMATICS, 2012, 28 (05) : 701 - 708
  • [27] Informative priors based on transcription factor structural class improve de novo motif discovery
    Narlikar, Leelavati
    Gordan, Raluca
    Ohler, Uwe
    Hartemink, Alexander J.
    BIOINFORMATICS, 2006, 22 (14) : E384 - E392
  • [28] TrawlerWeb: an online de novo motif discovery tool for next-generation sequencing datasets
    Dang, Louis T.
    Tondl, Markus
    Chiu, Man Ho H.
    Revote, Jerico
    Paten, Benedict
    Tano, Vincent
    Tokolyi, Alex
    Besse, Florence
    Quaife-Ryan, Greg
    Cumming, Helen
    Drvodelic, Mark J.
    Eichenlaub, Michael P.
    Hallab, Jeannette C.
    Stolper, Julian S.
    Rossello, Fernando J.
    Bogoyevitch, Marie A.
    Jans, David A.
    Nim, Hieu T.
    Porrello, Enzo R.
    Hudson, James E.
    Ramialison, Mirana
    BMC GENOMICS, 2018, 19
  • [29] TrawlerWeb: an online de novo motif discovery tool for next-generation sequencing datasets
    Louis T. Dang
    Markus Tondl
    Man Ho H. Chiu
    Jerico Revote
    Benedict Paten
    Vincent Tano
    Alex Tokolyi
    Florence Besse
    Greg Quaife-Ryan
    Helen Cumming
    Mark J. Drvodelic
    Michael P. Eichenlaub
    Jeannette C. Hallab
    Julian S. Stolper
    Fernando J. Rossello
    Marie A. Bogoyevitch
    David A. Jans
    Hieu T. Nim
    Enzo R. Porrello
    James E. Hudson
    Mirana Ramialison
    BMC Genomics, 19
  • [30] De novo discovery of bicycles
    Kong, Xu-Dong
    Tian, Changlin
    NATURE CHEMICAL BIOLOGY, 2025, 21 (01) : 29 - 31