Improved Algorithms for Finding Edit Distance Based Motifs

被引:0
作者
Pal, Soumitra [1 ]
Rajasekaran, Sanguthevar [1 ]
机构
[1] Univ Connecticut, Comp Sci & Engn, 371 Fairfield Rd, Storrs, CT 06269 USA
来源
PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE | 2015年
关键词
SEARCH;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Motif search is an important step in extracting meaningful patterns from biological data. The general problem of motif search is intractable. There is a pressing need to develop efficient exact and approximation algorithms to solve this problem. In this paper we present novel algorithms for solving the (l, d) Edit-distance-based Motif Search (EMS) problem: given two integers l, d and n biological strings, find all strings of length l that appear in each input string with atmost d substitutions, insertions and deletions. The algorithms for EMS are customarily evaluated on several challenging instances such as (9,2), (11, 3), (13, 4), (15, 5), and so on. The best previously known algorithm, EMS1, solves up to instance (11, 3) in estimated 3 days. Our algorithm is more than 20 times faster than EMS1. For example, our algorithm solves instance (11, 3) in a couple of minutes and instance (14, 3) in a couple of hours. This significant improvement is due to a novel and provably efficient neighborhood generation technique introduced in this paper. Firstly, we show that it is enough to consider the neighbors which are at a distance exactly d from all possible substrings of the input strings. Secondly, we compactly represent the candidate motifs in the neighborhood using wildcard characters. Thirdly, we generate these compact candidate motifs nearly uniquely with very few repetitions. Finally, we use a trie based data structure to efficiently store the candidate motifs and to output the final motifs in a sorted order. We believe that the techniques we introduce in this paper are also applicable to other motif search problems such as the PMS.
引用
收藏
页码:537 / 542
页数:6
相关论文
共 14 条
[1]  
Adebiyi EF, 2002, LECT NOTES COMPUT SC, V2452, P140
[2]  
[Anonymous], P 2 ANN INT C COMP M
[3]  
[Anonymous], 2005, ART COMPUTER PROGRAM
[4]  
Karlin S., 1989, MATH METHODS DNA SEQ
[5]   Distinguishing string selection problems [J].
Lanctot, JK ;
Li, M ;
Ma, B ;
Wang, SJ ;
Zhang, LX .
INFORMATION AND COMPUTATION, 2003, 185 (01) :41-55
[6]  
Nicolae M., 2015, Nature Scientific Reports, V5
[7]   Efficient sequential and parallel algorithms for planted motif search [J].
Nicolae, Marius ;
Rajasekaran, Sanguthevar .
BMC BIOINFORMATICS, 2014, 15
[8]   EMS1: ANELEGANT ALGORITHM FOR EDIT DISTANCE BASED MOTIF SEARCH [J].
Pathak, Sudipta ;
Rajasekaran, Sanguthevar ;
Nicolae, Marius .
INTERNATIONAL JOURNAL OF FOUNDATIONS OF COMPUTER SCIENCE, 2013, 24 (04) :473-486
[9]  
Pevzner P., 2000, P 8 INT C INT SYST M, V2000, P269
[10]   High-performance exact algorithms for motif search [J].
Rajasekaran S. ;
Balla S. ;
Huang C.-H. ;
Thapar V. ;
Gryk M. ;
Maciejewski M. ;
Schiller M. .
Journal of Clinical Monitoring and Computing, 2005, 19 (4-5) :319-328