K2 and K2*: efficient alignment-free sequence similarity measurement based on Kendall statistics

被引:7
作者
Lin, Jie [1 ]
Adjeroh, Donald A. [2 ]
Jiang, Bing-Hua [3 ]
Jiang, Yue [1 ]
机构
[1] Fujian Normal Univ, Dept Software Engn, Coll Math & Informat, Fuzhou 350108, Fujian, Peoples R China
[2] West Virginia Univ, Dept Comp Sci & Elect Engn, Morgantown, WV 26506 USA
[3] Univ Iowa, Dept Pathol, Carver Coll Med, Iowa City, IA 52242 USA
基金
美国国家科学基金会;
关键词
WORD FREQUENCIES; DISTANCE MEASURE; DNA-SEQUENCES; COMPRESSION; PHYLOGENY;
D O I
10.1093/bioinformatics/btx809
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Alignment-free sequence comparison methods can compute the pairwise similarity between a huge number of sequences much faster than sequence-alignment based methods. Results: We propose a new non-parametric alignment-free sequence comparison method, called K-2, based on the Kendall statistics. Comparing to the other state-of-the-art alignment-free comparison methods, K-2 demonstrates competitive performance in generating the phylogenetic tree, in evaluating functionally related regulatory sequences, and in computing the edit distance (similarity/dissimilarity) between sequences. Furthermore, the K-2 approach is much faster than the other methods. An improved method, K-2*, is also proposed, which is able to determine the appropriate algorithmic parameter (length) automatically, without first considering different values. Comparative analysis with the state-of-the-art alignment-free sequence similarity methods demonstrates the superiority of the proposed approaches, especially with increasing sequence length, or increasing dataset sizes.
引用
收藏
页码:1682 / 1689
页数:8
相关论文
共 45 条
  • [1] Aach John., 2001, Nature, V26, P5
  • [2] Adjeroh D., 2008, The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching
  • [3] [Anonymous], 1997, ACM SIGACT NEWS
  • [4] Robinson-Foulds Supertrees
    Bansal, Mukul S.
    Burleigh, J. Gordon
    Eulenstein, Oliver
    Fernandez-Baca, David
    [J]. ALGORITHMS FOR MOLECULAR BIOLOGY, 2010, 5
  • [5] A wavelet-based feature vector model for DNA clustering
    Bao, J. P.
    Yuan, R. Y.
    [J]. GENETICS AND MOLECULAR RESEARCH, 2015, 14 (04): : 19163 - 19172
  • [6] An improved alignment-free model for dna sequence similarity metric
    Bao, Junpeng
    Yuan, Ruiyu
    Bao, Zhe
    [J]. BMC BIOINFORMATICS, 2014, 15
  • [7] The average mutual information profile as a genomic signature
    Bauer, Mark
    Schuster, Sheldon M.
    Sayood, Khalid
    [J]. BMC BIOINFORMATICS, 2008, 9 (1)
  • [8] Beal R, 2016, IEEE INT C BIOINFORM, P92, DOI 10.1109/BIBM.2016.7822498
  • [9] A new algorithm for "the LCS problem" with application in compressing genome resequencing data
    Beal, Richard
    Afrin, Tazin
    Farheen, Aliya
    Adjeroh, Donald
    [J]. BMC GENOMICS, 2016, 17