Alignment-Free Sequence Comparison (II): Theoretical Power of Comparison Statistics

被引：84

作者：

Wan, Lin ^{[1
]}

Reinert, Gesine ^{[2
]}

Sun, Fengzhu ^{[1
,3
]}

Waterman, Michael S. ^{[1
,3
]}

机构：

[1] Univ So Calif, Los Angeles, CA 90089 USA

[2] Univ Oxford, Dept Stat, Oxford OX1 3TG, England

[3] Tsinghua Univ, TNLIST Dept Automat, Beijing 100084, Peoples R China

来源：

JOURNAL OF COMPUTATIONAL BIOLOGY | 2010年 / 17卷 / 11期

基金：

英国工程与自然科学研究理事会; 英国生物技术与生命科学研究理事会;

关键词：

alignment-free; hidden Markov model; motifs; normal approximation; power; sequence alignment; word count statistics; K-WORD MATCHES;

D O I：

10.1089/cmb.2010.0056

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Rapid methods for alignment-free sequence comparison make large-scale comparisons between sequences increasingly feasible. Here we study the power of the statistic D-2, which counts the number of matching k-tuples between two sequences, as well as D*(2), which uses centralized counts, and D-2(S), which is a self-standardized version, both from a theoretical viewpoint and numerically, providing an easy to use program. The power is assessed under two alternative hidden Markov models; the first one assumes that the two sequences share a common motif, whereas the second model is a pattern transfer model; the null model is that the two sequences are composed of independent and identically distributed letters and they are independent. Under the first alternative model, the means of the tuple counts in the individual sequences change, whereas under the second alternative model, the marginal means are the same as under the null model. Using the limit distributions of the count statistics under the null and the alternative models, we find that generally, asymptotically D-2(S) has the largest power, followed by D*(2), whereas the power of D-2 can even be zero in some cases. In contrast, even for sequences of length 140,000 bp, in simulations D*(2) generally has the largest power. Under the first alternative model of a shared motif, the power of D*(2) approaches 100% when sufficiently many motifs are shared, and we recommend the use of D*(2) for such practical applications. Under the second alternative model of pattern transfer, the power for all three count statistics does not increase with sequence length when the sequence is sufficiently long, and hence none of the three statistics under consideration can be recommended in such a situation. We illustrate the approach on 323 transcription factor binding motifs with length at most 10 from JASPAR CORE (October 12, 2009 version), verifying that D*(2) is generally more powerful than D-2. The program to calculate the power of D-2, D*(2) and D-2(S) can be downloaded from http://meta.cmb.usc.edu/d2. Supplementary Material is available at www.liebertonline.com/cmb.

引用

页码：1467 / +

页数：24

共 15 条

[1] Approximate word matches between two random sequences [J].