Alignment-Free Sequence Comparison (II): Theoretical Power of Comparison Statistics

被引:84
作者
Wan, Lin [1 ]
Reinert, Gesine [2 ]
Sun, Fengzhu [1 ,3 ]
Waterman, Michael S. [1 ,3 ]
机构
[1] Univ So Calif, Los Angeles, CA 90089 USA
[2] Univ Oxford, Dept Stat, Oxford OX1 3TG, England
[3] Tsinghua Univ, TNLIST Dept Automat, Beijing 100084, Peoples R China
基金
英国工程与自然科学研究理事会; 英国生物技术与生命科学研究理事会;
关键词
alignment-free; hidden Markov model; motifs; normal approximation; power; sequence alignment; word count statistics; K-WORD MATCHES;
D O I
10.1089/cmb.2010.0056
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Rapid methods for alignment-free sequence comparison make large-scale comparisons between sequences increasingly feasible. Here we study the power of the statistic D-2, which counts the number of matching k-tuples between two sequences, as well as D*(2), which uses centralized counts, and D-2(S), which is a self-standardized version, both from a theoretical viewpoint and numerically, providing an easy to use program. The power is assessed under two alternative hidden Markov models; the first one assumes that the two sequences share a common motif, whereas the second model is a pattern transfer model; the null model is that the two sequences are composed of independent and identically distributed letters and they are independent. Under the first alternative model, the means of the tuple counts in the individual sequences change, whereas under the second alternative model, the marginal means are the same as under the null model. Using the limit distributions of the count statistics under the null and the alternative models, we find that generally, asymptotically D-2(S) has the largest power, followed by D*(2), whereas the power of D-2 can even be zero in some cases. In contrast, even for sequences of length 140,000 bp, in simulations D*(2) generally has the largest power. Under the first alternative model of a shared motif, the power of D*(2) approaches 100% when sufficiently many motifs are shared, and we recommend the use of D*(2) for such practical applications. Under the second alternative model of pattern transfer, the power for all three count statistics does not increase with sequence length when the sequence is sufficiently long, and hence none of the three statistics under consideration can be recommended in such a situation. We illustrate the approach on 323 transcription factor binding motifs with length at most 10 from JASPAR CORE (October 12, 2009 version), verifying that D*(2) is generally more powerful than D-2. The program to calculate the power of D-2, D*(2) and D-2(S) can be downloaded from http://meta.cmb.usc.edu/d2. Supplementary Material is available at www.liebertonline.com/cmb.
引用
收藏
页码:1467 / +
页数:24
相关论文
共 15 条
[1]   Approximate word matches between two random sequences [J].
Burden, Conrad J. ;
Kantorovitz, Miriam R. ;
Wilson, Susan R. .
ANNALS OF APPLIED PROBABILITY, 2008, 18 (01) :1-21
[2]   Characterizing the D2 Statistic: Word Matches in Biological Sequences [J].
Foret, Sylvain ;
Wilson, Susan R. ;
Burden, Conrad J. .
STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY, 2009, 8 (01)
[3]   Empirical distribution of k-word matches in biological sequences [J].
Foret, Sylvain ;
Wilson, Susan R. ;
Burden, Conrad J. .
PATTERN RECOGNITION, 2009, 42 (04) :539-548
[4]   Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences [J].
Foret, Sylvain ;
Kantorovitz, Miriam R. ;
Burden, Conrad J. .
BMC BIOINFORMATICS, 2006, 7 (Suppl 5) :S21
[5]   Computational discovery of cis-regulatory modules in Drosophila without prior knowledge of motifs [J].
Ivan, Andra ;
Halfon, Marc S. ;
Sinha, Saurabh .
GENOME BIOLOGY, 2008, 9 (01)
[6]   Asymptotic behavior of k-word matches between two uniformly distributed sequences [J].
Kantorovitz, M. R. ;
Booth, H. S. ;
Burden, C. J. ;
Wilson, S. R. .
JOURNAL OF APPLIED PROBABILITY, 2007, 44 (03) :788-805
[7]   A statistical method for alignment-free comparison of regulatory sequences [J].
Kantorovitz, Miriam R. ;
Robinson, Gene E. ;
Sinha, Saurabh .
BIOINFORMATICS, 2007, 23 (13) :I249-I255
[8]   Distributional regimes for the number of k-word matches between two random sequences [J].
Lippert, RA ;
Huang, HY ;
Waterman, MS .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2002, 99 (22) :13980-13989
[9]   LINGUISTIC FEATURES OF NONCODING DNA-SEQUENCES [J].
MANTEGNA, RN ;
BULDYREV, SV ;
GOLDBERGER, AL ;
HAVLIN, S ;
PENG, CK ;
SIMONS, M ;
STANLEY, HE .
PHYSICAL REVIEW LETTERS, 1994, 73 (23) :3169-3172
[10]   A new characterization of the normal law [J].
Novak, S. Y. .
STATISTICS & PROBABILITY LETTERS, 2007, 77 (01) :95-98