New powerful statistics for alignment-free sequence comparison under a pattern transfer model

被引:33
作者
Liu, Xuemei [1 ,2 ]
Wan, Lin [1 ]
Li, Jing [1 ]
Reinert, Gesine [3 ]
Waterman, Michael S. [1 ,4 ]
Sun, Fengzhu [1 ,4 ]
机构
[1] Univ So Calif, Mol & Computat Biol Program, Los Angeles, CA 90089 USA
[2] S China Univ Technol, Sch Phys, Guangzhou, Guangdong, Peoples R China
[3] Univ Oxford, Dept Stat, Oxford OX1 3TG, England
[4] Tsinghua Univ, TNLIST Dept Automat, Beijing 100084, Peoples R China
基金
中国国家自然科学基金;
关键词
Alignment-free sequence comparison; D-2; Pattern transfer model; K-WORD MATCHES; PHYLOGENETIC TREE RECONSTRUCTION; FEATURE FREQUENCY PROFILES; COMPOSITION VECTOR METHOD; WHOLE-PROTEOME PHYLOGENY; REGULATORY SEQUENCES; ASYMPTOTIC-BEHAVIOR; DNA; GENOMES; DISSIMILARITY;
D O I
10.1016/j.jtbi.2011.06.020
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Alignment-free sequence comparison is widely used for comparing gene regulatory regions and for identifying horizontally transferred genes. Recent studies on the power of a widely used alignment-free comparison statistic D-2 and its variants D-2* and D-2(s) showed that their power approximates a limit smaller than 1 as the sequence length tends to infinity under a pattern transfer model. We develop new alignment-free statistics based on D-2, D-2* and D-2(s) by comparing local sequence pairs and then summing over all the local sequence pairs of certain length. We show that the new statistics are much more powerful than the corresponding statistics and the power tends to 1 as the sequence length tends to infinity under the pattern transfer model. (c) 2011 Elsevier Ltd. All rights reserved.
引用
收藏
页码:106 / 116
页数:11
相关论文
共 35 条
  • [2] Approximate word matches between two random sequences
    Burden, Conrad J.
    Kantorovitz, Miriam R.
    Wilson, Susan R.
    [J]. ANNALS OF APPLIED PROBABILITY, 2008, 18 (01) : 1 - 21
  • [3] Comparison study on k-word statistical measures for protein: From sequence to 'sequence space'
    Dai, Qi
    Wang, Tianming
    [J]. BMC BIOINFORMATICS, 2008, 9 (1)
  • [4] Bayesian classifiers for detecting HGT using fixed and variable order Markov models of genomic signatures
    Dalevi, D
    Dubhashi, D
    Hermansson, M
    [J]. BIOINFORMATICS, 2006, 22 (05) : 517 - 522
  • [5] Detection and characterization of horizontal transfers in prokaryotes using genomic signature
    Dufraigne, C
    Fertil, B
    Lespinats, S
    Giron, A
    Deschavanne, P
    [J]. NUCLEIC ACIDS RESEARCH, 2005, 33 (01) : e6
  • [6] Horizontal transfer of tumor DNA to endothelial cells in vivo
    Ehnfors, J.
    Kost-Alimova, M.
    Persson, N. Luna
    Bergsmedh, A.
    Castro, J.
    Levchenko-Tegnebratt, T.
    Yang, L.
    Panaretakis, T.
    Holmgren, L.
    [J]. CELL DEATH AND DIFFERENTIATION, 2009, 16 (05) : 749 - 757
  • [7] Characterizing the D2 Statistic: Word Matches in Biological Sequences
    Foret, Sylvain
    Wilson, Susan R.
    Burden, Conrad J.
    [J]. STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY, 2009, 8 (01)
  • [8] Empirical distribution of k-word matches in biological sequences
    Foret, Sylvain
    Wilson, Susan R.
    Burden, Conrad J.
    [J]. PATTERN RECOGNITION, 2009, 42 (04) : 539 - 548
  • [9] Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences
    Foret, Sylvain
    Kantorovitz, Miriam R.
    Burden, Conrad J.
    [J]. BMC BIOINFORMATICS, 2006, 7 (Suppl 5) : S21
  • [10] Whole genome molecular phylogeny of large dsDNA viruses using composition vector method
    Gao, Lei
    Qi, Ji
    [J]. BMC EVOLUTIONARY BIOLOGY, 2007, 7 (1)