The Limits of De Novo DNA Motif Discovery

被引:21
|
作者
Simcha, David [1 ]
Price, Nathan D. [2 ]
Geman, Donald [3 ]
机构
[1] Johns Hopkins Univ, Dept Biomed Engn, Baltimore, MD 21218 USA
[2] Inst Syst Biol, Seattle, WA USA
[3] Johns Hopkins Univ, Dept Appl Math & Stat, Baltimore, MD USA
来源
PLOS ONE | 2012年 / 7卷 / 11期
基金
美国国家卫生研究院;
关键词
FACTOR-BINDING SITES; GENE-EXPRESSION; HUMAN GENOME; SEQUENCES; PROFILES; ELEMENTS; TOOLS; MAP;
D O I
10.1371/journal.pone.0047836
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
A major challenge in molecular biology is reverse-engineering the cis-regulatory logic that plays a major role in the control of gene expression. This program includes searching through DNA sequences to identify "motifs" that serve as the binding sites for transcription factors or, more generally, are predictive of gene expression across cellular conditions. Several approaches have been proposed for de novo motif discovery-searching sequences without prior knowledge of binding sites or nucleotide patterns. However, unbiased validation is not straightforward. We consider two approaches to unbiased validation of discovered motifs: testing the statistical significance of a motif using a DNA "background" sequence model to represent the null hypothesis and measuring performance in predicting membership in gene clusters. We demonstrate that the background models typically used are "too null," resulting in overly optimistic assessments of significance, and argue that performance in predicting TF binding or expression patterns from DNA motifs should be assessed by held-out data, as in predictive learning. Applying this criterion to common motif discovery methods resulted in universally poor performance, although there is a marked improvement when motifs are statistically significant against real background sequences. Moreover, on synthetic data where "ground truth" is known, discriminative performance of all algorithms is far below the theoretical upper bound, with pronounced "over-fitting" in training. A key conclusion from this work is that the failure of de novo discovery approaches to accurately identify motifs is basically due to statistical intractability resulting from the fixed size of co-regulated gene clusters, and thus such failures do not necessarily provide evidence that unfound motifs are not active biologically. Consequently, the use of prior knowledge to enhance motif discovery is not just advantageous but necessary. An implementation of the LR and ALR algorithms is available at http://code.google.com/p/likelihood-ratio-motifs/.
引用
收藏
页数:9
相关论文
共 50 条
  • [31] THE DISCOVERY OF DE NOVO GENE EVOLUTION
    Tautz, Diethard
    PERSPECTIVES IN BIOLOGY AND MEDICINE, 2014, 57 (01) : 149 - 161
  • [32] De Novo Regulatory Motif Discovery Identifies Significant Motifs in Promoters of Five Classes of Plant Dehydrin Genes
    Zolotarov, Yevgen
    Stroemvik, Martina
    PLOS ONE, 2015, 10 (06):
  • [33] De novo Motif Prediction using the Fireworks Algorithm
    Lihu, Andrei
    Holban, Stefan
    INTERNATIONAL JOURNAL OF SWARM INTELLIGENCE RESEARCH, 2015, 6 (03) : 24 - 40
  • [34] SArKS: de novo discovery of gene expression regulatory motif sites and domains by suffix array kernel smoothing
    Wylie, Dennis C.
    Hofmann, Hans A.
    Zemelman, Boris V.
    BIOINFORMATICS, 2019, 35 (20) : 3944 - 3952
  • [35] SMARTIV: combined sequence and structure de-novo motif discovery for in-vivo RNA binding data
    Polishchuk, Maya
    Paz, Inbal
    Yakhini, Zohar
    Mandel-Gutfreund, Yael
    NUCLEIC ACIDS RESEARCH, 2018, 46 (W1) : W221 - W228
  • [36] Performance evaluation of DNA motif discovery programs
    Singh, Chandra Prakash
    Khan, Feroz
    Mishra, Bhartendu Nath
    Chauhan, Durg Singh
    BIOINFORMATION, 2008, 3 (05) : 205 - 212
  • [37] A visualization approach to Motif discovery in DNA sequences
    Rambally, Gerard
    PROCEEDINGS IEEE SOUTHEASTCON 2007, VOLS 1 AND 2, 2007, : 348 - 353
  • [38] Seeder: discriminative seeding DNA motif discovery
    Fauteux, Francois
    Blanchette, Mathieu
    Stromvik, Martina V.
    BIOINFORMATICS, 2008, 24 (20) : 2303 - 2307
  • [39] Genomic background sequences systematically outperform synthetic ones in de novo motif discovery for ChIP-seq data
    Raditsa, Vladimir V.
    Tsukanov, Anton, V
    Bogomolov, Anton G.
    Levitsky, Victor G.
    NAR GENOMICS AND BIOINFORMATICS, 2024, 6 (03)
  • [40] PriSeT: Efficient De Novo Primer Discovery
    Hoffmann, Marie
    Monaghan, Michael T.
    Reinert, Knut
    12TH ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND HEALTH INFORMATICS (ACM-BCB 2021), 2021,