New powerful statistics for alignment-free sequence comparison under a pattern transfer model

被引:33
作者
Liu, Xuemei [1 ,2 ]
Wan, Lin [1 ]
Li, Jing [1 ]
Reinert, Gesine [3 ]
Waterman, Michael S. [1 ,4 ]
Sun, Fengzhu [1 ,4 ]
机构
[1] Univ So Calif, Mol & Computat Biol Program, Los Angeles, CA 90089 USA
[2] S China Univ Technol, Sch Phys, Guangzhou, Guangdong, Peoples R China
[3] Univ Oxford, Dept Stat, Oxford OX1 3TG, England
[4] Tsinghua Univ, TNLIST Dept Automat, Beijing 100084, Peoples R China
基金
中国国家自然科学基金;
关键词
Alignment-free sequence comparison; D-2; Pattern transfer model; K-WORD MATCHES; PHYLOGENETIC TREE RECONSTRUCTION; FEATURE FREQUENCY PROFILES; COMPOSITION VECTOR METHOD; WHOLE-PROTEOME PHYLOGENY; REGULATORY SEQUENCES; ASYMPTOTIC-BEHAVIOR; DNA; GENOMES; DISSIMILARITY;
D O I
10.1016/j.jtbi.2011.06.020
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Alignment-free sequence comparison is widely used for comparing gene regulatory regions and for identifying horizontally transferred genes. Recent studies on the power of a widely used alignment-free comparison statistic D-2 and its variants D-2* and D-2(s) showed that their power approximates a limit smaller than 1 as the sequence length tends to infinity under a pattern transfer model. We develop new alignment-free statistics based on D-2, D-2* and D-2(s) by comparing local sequence pairs and then summing over all the local sequence pairs of certain length. We show that the new statistics are much more powerful than the corresponding statistics and the power tends to 1 as the sequence length tends to infinity under the pattern transfer model. (c) 2011 Elsevier Ltd. All rights reserved.
引用
收藏
页码:106 / 116
页数:11
相关论文
共 35 条
[2]   Approximate word matches between two random sequences [J].
Burden, Conrad J. ;
Kantorovitz, Miriam R. ;
Wilson, Susan R. .
ANNALS OF APPLIED PROBABILITY, 2008, 18 (01) :1-21
[3]   Comparison study on k-word statistical measures for protein: From sequence to 'sequence space' [J].
Dai, Qi ;
Wang, Tianming .
BMC BIOINFORMATICS, 2008, 9 (1)
[4]   Bayesian classifiers for detecting HGT using fixed and variable order Markov models of genomic signatures [J].
Dalevi, D ;
Dubhashi, D ;
Hermansson, M .
BIOINFORMATICS, 2006, 22 (05) :517-522
[5]   Detection and characterization of horizontal transfers in prokaryotes using genomic signature [J].
Dufraigne, C ;
Fertil, B ;
Lespinats, S ;
Giron, A ;
Deschavanne, P .
NUCLEIC ACIDS RESEARCH, 2005, 33 (01) :e6
[6]   Horizontal transfer of tumor DNA to endothelial cells in vivo [J].
Ehnfors, J. ;
Kost-Alimova, M. ;
Persson, N. Luna ;
Bergsmedh, A. ;
Castro, J. ;
Levchenko-Tegnebratt, T. ;
Yang, L. ;
Panaretakis, T. ;
Holmgren, L. .
CELL DEATH AND DIFFERENTIATION, 2009, 16 (05) :749-757
[7]   Characterizing the D2 Statistic: Word Matches in Biological Sequences [J].
Foret, Sylvain ;
Wilson, Susan R. ;
Burden, Conrad J. .
STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY, 2009, 8 (01)
[8]   Empirical distribution of k-word matches in biological sequences [J].
Foret, Sylvain ;
Wilson, Susan R. ;
Burden, Conrad J. .
PATTERN RECOGNITION, 2009, 42 (04) :539-548
[9]   Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences [J].
Foret, Sylvain ;
Kantorovitz, Miriam R. ;
Burden, Conrad J. .
BMC BIOINFORMATICS, 2006, 7 (Suppl 5) :S21
[10]   Whole genome molecular phylogeny of large dsDNA viruses using composition vector method [J].
Gao, Lei ;
Qi, Ji .
BMC EVOLUTIONARY BIOLOGY, 2007, 7 (1)