Empirical distribution of k-word matches in biological sequences

被引:19
作者
Foret, Sylvain [1 ]
Wilson, Susan R. [1 ]
Burden, Conrad J. [1 ,2 ]
机构
[1] Australian Natl Univ, Inst Math Sci, Ctr Bioinformat Sci, Canberra, ACT 0200, Australia
[2] Australian Natl Univ, John Curtin Sch Med Res, Canberra, ACT 0200, Australia
关键词
Alignment-free sequence comparison; Biological sequences; Genomic data;
D O I
10.1016/j.patcog.2008.06.026
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This study focuses on an alignment-free sequence Comparison method: the number of words of length k shared between two sequences, also known as the D-2 statistic. The advantages of the use of this statistic over alignment-based methods are firstly that it does not assume that homologous segments are contiguous, and secondly that the algorithm is computationally extremely fast, the runtime being proportional to the size of the sequence under scrutiny. Existing applications of the D-2 statistic include the clustering of related sequences in large EST databases such as the STACK database. Such applications have typically relied on heuristics without any statistical basis. Rigorous statistical characterisations of the distribution of D-2 have subsequently been undertaken, but have focussed on the distribution's asymptotic behaviour, leaving the distribution of D-2 uncharacterised for most practical cases. The work presented here bridges these two worlds to give usable approximations of the distribution of D-2 for ranges of parameters most frequently encountered in the study of biological sequences. (C) 2008 Elsevier Ltd. All rights reserved.
引用
收藏
页码:539 / 548
页数:10
相关论文
共 7 条
[1]  
[Anonymous], 1995, Introduction to computational biology: maps, sequences and genomes
[2]   Approximate word matches between two random sequences [J].
Burden, Conrad J. ;
Kantorovitz, Miriam R. ;
Wilson, Susan R. .
ANNALS OF APPLIED PROBABILITY, 2008, 18 (01) :1-21
[3]  
Conover William Jay, 1999, Practical nonparametric statistics, V350
[4]   Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences [J].
Foret, Sylvain ;
Kantorovitz, Miriam R. ;
Burden, Conrad J. .
BMC BIOINFORMATICS, 2006, 7 (Suppl 5) :S21
[5]  
Gumbel E. J., 1958, Statistics of Extremes
[6]   A statistical method for alignment-free comparison of regulatory sequences [J].
Kantorovitz, Miriam R. ;
Robinson, Gene E. ;
Sinha, Saurabh .
BIOINFORMATICS, 2007, 23 (13) :I249-I255
[7]   Distributional regimes for the number of k-word matches between two random sequences [J].
Lippert, RA ;
Huang, HY ;
Waterman, MS .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2002, 99 (22) :13980-13989