Probabilistic and statistical properties of words: An overview

被引:170
作者
Reinert, G
Schbath, S
Waterman, MS
机构
[1] Kings Coll London, Cambridge CB2 1ST, England
[2] Stat Lab, Cambridge CB2 1ST, England
[3] INRA, Biometr Unit, F-78352 Jouy En Josas, France
[4] Univ So Calif, Dept Math, Los Angeles, CA 90089 USA
[5] Univ So Calif, Dept Biol Sci, Los Angeles, CA 90089 USA
[6] Univ So Calif, Dept Comp Sci, Los Angeles, CA 90089 USA
关键词
word counts; renewal counts; Markov model; exact distribution; normal approximation; Poisson process approximation; compound Poisson approximation; occurrences of multiple words; sequencing by hybridization; martingales; moment generating functions; Stein's method; Chen-Stein method;
D O I
10.1089/10665270050081360
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
In the following, an overview is given on statistical and probabilistic properties of words, as occurring in the analysis of biological sequences. Counts of occurrence, counts of clumps, and renewal counts are distinguished, and exact distributions as well as normal approximations, Poisson process approximations, and compound Poisson approximations are derived. Here, a sequence is modelled as a stationary ergodic Markov chain; a test for determining the appropriate order of the Markov chain is described. The convergence results take the error made by estimating the Markovian transition probabilities into account, The main tools involved are moment generating functions, martingales, Stein's method, and the Chen-Stein method. Similar results are given for occurrences of multiple patterns, and, as an example, the problem of unique recoverability of a sequence from SBH chip data is discussed, Special emphasis lies on disentangling the complicated dependence structure between word occurrences, due to self-overlap as well as due to overlap between words. The results can be used to derive approximate, and conservative, confidence intervals for tests.
引用
收藏
页码:1 / 46
页数:46
相关论文
共 63 条
  • [1] [Anonymous], THESIS U ZURICH
  • [2] [Anonymous], 1990, Large Deviation Techniques in Decision, Simulation and Estimation
  • [3] Poisson process approximation for sequence repeats, and sequencing by hybridization
    Arratia, R
    Martin, D
    Reinert, G
    Waterman, MS
    [J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 1996, 3 (03) : 425 - 463
  • [4] 2 MOMENTS SUFFICE FOR POISSON APPROXIMATIONS - THE CHEN-STEIN METHOD
    ARRATIA, R
    GOLDSTEIN, L
    GORDON, L
    [J]. ANNALS OF PROBABILITY, 1989, 17 (01) : 9 - 25
  • [5] ARRATIA R, 1999, EULER CIRCUITS DNA S
  • [6] Arratia R., 1990, STAT SCI, P403, DOI [10.1214/ss/1177012015, DOI 10.1214/SS/1177012015]
  • [7] ASPOSTOLICO A, 1998, P COMPR COMPL SEQUEN, P215
  • [8] Solving the Stein equation in compound Poisson approximation
    Barbour, AD
    Utev, S
    [J]. ADVANCES IN APPLIED PROBABILITY, 1998, 30 (02) : 449 - 475
  • [9] COMPOUND POISSON APPROXIMATION FOR NONNEGATIVE RANDOM-VARIABLES VIA STEIN METHOD
    BARBOUR, AD
    CHEN, LHY
    LOH, WL
    [J]. ANNALS OF PROBABILITY, 1992, 20 (04) : 1843 - 1866
  • [10] BARBOUR AD, 1999, POISSON PERTURBATION