Sequence similarity measures based on bounded hamming distance

被引:20
作者
Apostolico, Alberto [1 ,2 ]
Guerra, Concettina [1 ,2 ]
Landau, Gad M. [3 ,4 ]
Pizzi, Cinzia [5 ]
机构
[1] Georgia Inst Technol, Coll Comp, 801 Atlantic Dr, Atlanta, GA 30318 USA
[2] CNR, Ist Anal Sistemi & Informat, Rome, Italy
[3] Univ Haifa, Dept Comp Sci, IL-31905 Haifa, Israel
[4] NYU, MetroTech Ctr 6, NYU Polytech Sch Engn, Dept Comp Sci & Engn, Brooklyn, NY 11201 USA
[5] Univ Padua, Dipartimento Ingn Informaz, Via Gradenigo 6-A, I-30131 Padua, Italy
基金
美国国家科学基金会; 以色列科学基金会;
关键词
Pattern matching; String comparison; Alignment free distances; Binary string; Longest common substring; Mismatches; RECOGNITION; DISCOVERY; PHYLOGENY;
D O I
10.1016/j.tcs.2016.01.023
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
A growing number of measures of sequence similarity are being based on some underlying notion of relative compressibility. Within this paradigm, similar sequences are expected to share a large number of common substrings, or subsequences, or more complex patterns or motifs, and so on. In this paper, measures of sequence similarity are introduced and studied in which patterns in a pair are considered similar if they coincide up to a preset number of mismatches, that is, within a bounded Hamming distance. It is shown here that for some such measures bounds are achievable that are slightly better than O(n(2)). Preliminary experiments demonstrate the potential applicability to phylogeny and classification of these similarity measures. (C) 2016 Elsevier B.V. All rights reserved.
引用
收藏
页码:76 / 90
页数:15
相关论文
共 30 条
[1]  
[Anonymous], 1972, INFORM THEORY LIVING
[2]   Motif discovery by monotone scores [J].
Apostolico, Alberto ;
Pizzi, Cinzia .
DISCRETE APPLIED MATHEMATICS, 2007, 155 (6-7) :695-706
[3]   Fast gapped variants for Lempel-Ziv-Welch compression [J].
Apostolico, Alberto .
INFORMATION AND COMPUTATION, 2007, 205 (07) :1012-1026
[4]   Alignment Free Sequence Similarity with Bounded Hamming Distance (Extended Abstract) [J].
Apostolico, Alberto ;
Guerra, Concettina ;
Pizzi, Cinzia .
2014 DATA COMPRESSION CONFERENCE (DCC 2014), 2014, :183-192
[5]   Sequence similarity by gapped LZW [J].
Apostolico, Alberto ;
Cunial, Fabio .
2011 DATA COMPRESSION CONFERENCE (DCC), 2011, :343-352
[6]   Efficient algorithms for the discovery of gapped factors [J].
Apostolico, Alberto ;
Pizzi, Cinzia ;
Ukkonen, Esko .
ALGORITHMS FOR MOLECULAR BIOLOGY, 2011, 6
[7]   Efficient tools for comparative substring analysis [J].
Apostolico, Alberto ;
Denas, Olgert ;
Dress, Andreas .
JOURNAL OF BIOTECHNOLOGY, 2010, 149 (03) :120-126
[8]   Scoring Unusual Words with Varying Mismatch Errors [J].
Apostolico, Alberto ;
Pizzi, Cinzia .
MATHEMATICS IN COMPUTER SCIENCE, 2008, 1 (04) :639-653
[10]  
Brillouin L, 1971, SCI INFORM THEORY