K2 and K2*: efficient alignment-free sequence similarity measurement based on Kendall statistics

被引：7

作者：

Lin, Jie ^{[1
]}

Adjeroh, Donald A. ^{[2
]}

Jiang, Bing-Hua ^{[3
]}

Jiang, Yue ^{[1
]}

机构：

[1] Fujian Normal Univ, Dept Software Engn, Coll Math & Informat, Fuzhou 350108, Fujian, Peoples R China

[2] West Virginia Univ, Dept Comp Sci & Elect Engn, Morgantown, WV 26506 USA

[3] Univ Iowa, Dept Pathol, Carver Coll Med, Iowa City, IA 52242 USA

来源：

BIOINFORMATICS | 2018年 / 34卷 / 10期

基金：

美国国家科学基金会;

关键词：

WORD FREQUENCIES; DISTANCE MEASURE; DNA-SEQUENCES; COMPRESSION; PHYLOGENY;

D O I：

10.1093/bioinformatics/btx809

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Motivation: Alignment-free sequence comparison methods can compute the pairwise similarity between a huge number of sequences much faster than sequence-alignment based methods. Results: We propose a new non-parametric alignment-free sequence comparison method, called K-2, based on the Kendall statistics. Comparing to the other state-of-the-art alignment-free comparison methods, K-2 demonstrates competitive performance in generating the phylogenetic tree, in evaluating functionally related regulatory sequences, and in computing the edit distance (similarity/dissimilarity) between sequences. Furthermore, the K-2 approach is much faster than the other methods. An improved method, K-2*, is also proposed, which is able to determine the appropriate algorithmic parameter (length) automatically, without first considering different values. Comparative analysis with the state-of-the-art alignment-free sequence similarity methods demonstrates the superiority of the proposed approaches, especially with increasing sequence length, or increasing dataset sizes.

引用

页码：1682 / 1689

页数：8

共 45 条

[1] Aach John., 2001, Nature, V26, P5
[2] Adjeroh D., 2008, The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching
[3] [Anonymous], 1997, ACM SIGACT NEWS
[4] Robinson-Foulds Supertrees
Bansal, Mukul S.
Burleigh, J. Gordon
Eulenstein, Oliver
Fernandez-Baca, David
[J]. ALGORITHMS FOR MOLECULAR BIOLOGY, 2010, 5
[5] A wavelet-based feature vector model for DNA clustering
Bao, J. P.
Yuan, R. Y.
[J]. GENETICS AND MOLECULAR RESEARCH, 2015, 14 (04): : 19163 - 19172
[6] An improved alignment-free model for dna sequence similarity metric
Bao, Junpeng
Yuan, Ruiyu
Bao, Zhe
[J]. BMC BIOINFORMATICS, 2014, 15
[7] The average mutual information profile as a genomic signature
Bauer, Mark
Schuster, Sheldon M.
Sayood, Khalid
[J]. BMC BIOINFORMATICS, 2008, 9 (1)
[8] Beal R, 2016, IEEE INT C BIOINFORM, P92, DOI 10.1109/BIBM.2016.7822498
[9] A new algorithm for "the LCS problem" with application in compressing genome resequencing data
Beal, Richard
Afrin, Tazin
Farheen, Aliya
Adjeroh, Donald
[J]. BMC GENOMICS, 2016, 17
[10] A MEASURE OF THE SIMILARITY OF SETS OF SEQUENCES NOT REQUIRING SEQUENCE ALIGNMENT
BLAISDELL, BE
[J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1986, 83 (14) : 5155 - 5159

← 1 2 3 4 5 →