Asymptotic behavior of k-word matches between two uniformly distributed sequences

被引:15
作者
Kantorovitz, M. R.
Booth, H. S.
Burden, C. J.
Wilson, S. R.
机构
[1] Univ Illinois, Dept Math, Urbana, IL 61801 USA
[2] Australian Natl Univ, Inst Math Sci, Canberra, ACT 0200, Australia
关键词
Stein's method; count vector; k-word matches; sequence comparison;
D O I
10.1239/jap/1189717545
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Given two sequences of length n over a finite alphabet A of size vertical bar A vertical bar = d, the D-2) statistic is the number of k-letter word matches between the two sequences. This statistic is used in bioinformatics for EST sequence database searches. Under the assumption of independent and identically distributed letters in the sequences, Lippert, Huang and Waterman (2002) raised questions about the asymptotic behavior of D-2 when the alphabet is uniformly distributed. They expressed a concern that the commonly assumed normality may create errors in estimating significance. In this paper we answer those questions. Using Stein's method, we show that, for large enough k, the D-2 statistic is approximately normal as n gets large. When k = 1, we prove that, for large enough d, the D-2 statistic is approximately normal as n gets large. We also give a formula for the variance of D-2 in the uniform case.
引用
收藏
页码:788 / 805
页数:18
相关论文
共 16 条
[1]  
[Anonymous], IMA VOLUMES MATH ITS
[2]  
Barbour AD, 2001, ANN APPL PROBAB, V11, P964
[3]  
Billingsley P., 1995, PROBABILITY MEASURE
[4]   d2_cluster: A validated method for clustering EST and full-length cDNA sequences [J].
Burke, J ;
Davison, D ;
Hide, W .
GENOME RESEARCH, 1999, 9 (11) :1135-1142
[5]   Assessment of the parallelization approach of d2_cluster for high-performance sequence clustering [J].
Carpenter, JE ;
Christoffels, A ;
Weinbach, Y ;
Hide, WA .
JOURNAL OF COMPUTATIONAL CHEMISTRY, 2002, 23 (07) :755-757
[6]   POISSON APPROXIMATION FOR DEPENDENT TRIALS [J].
CHEN, LHY .
ANNALS OF PROBABILITY, 1975, 3 (03) :534-545
[7]   STACK: Sequence Tag Alignment and Consensus Knowledgebase [J].
Christoffels, A ;
van Gelder, A ;
Greyling, G ;
Miller, R ;
Hide, T ;
Hide, W .
NUCLEIC ACIDS RESEARCH, 2001, 29 (01) :234-238
[8]  
Johnson NL, 1970, CONTINUOUS UNIVARIAT
[9]   Distributional regimes for the number of k-word matches between two random sequences [J].
Lippert, RA ;
Huang, HY ;
Waterman, MS .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2002, 99 (22) :13980-13989
[10]   A comprehensive approach to clustering of expressed human gene sequence: The sequence tag alignment and consensus knowledge base [J].
Miller, RT ;
Christoffels, AG ;
Gopalakrishnan, C ;
Burke, J ;
Ptitsyn, AA ;
Broveak, TR ;
Hide, WA .
GENOME RESEARCH, 1999, 9 (11) :1143-1155