Sequence Comparison Alignment-Free Approach Based on Suffix Tree and L-Words Frequency

被引:9
作者
Soares, Ines [1 ,2 ,3 ]
Goios, Ana [2 ]
Amorim, Antonio [1 ,2 ]
机构
[1] Univ Porto, Fac Ciencias, P-4169 Oporto, Portugal
[2] Univ Porto, Inst Patol & Imunol Mol, P-4200 Oporto, Portugal
[3] Univ Porto, Ctr Matemat, P-4169 Oporto, Portugal
关键词
GENOME;
D O I
10.1100/2012/450124
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The vast majority of methods available for sequence comparison rely on a first sequence alignment step, which requires a number of assumptions on evolutionary history and is sometimes very difficult or impossible to perform due to the abundance of gaps (insertions/deletions). In such cases, an alternative alignment-free method would prove valuable. Our method starts by a computation of a generalized suffix tree of all sequences, which is completed in linear time. Using this tree, the frequency of all possible words with a preset length L-L-words-in each sequence is rapidly calculated. Based on the L-words frequency profile of each sequence, a pairwise standard Euclidean distance is then computed producing a symmetric genetic distance matrix, which can be used to generate a neighbor joining dendrogram or a multidimensional scaling graph. We present an improvement to word counting alignment-free approaches for sequence comparison, by determining a single optimal word length and combining suffix tree structures to the word counting tasks. Our approach is, thus, a fast and simple application that proved to be efficient and powerful when applied to mitochondrial genomes. The algorithm was implemented in Python language and is freely available on the web.
引用
收藏
页数:4
相关论文
共 12 条
[1]   Mugsy: fast multiple alignment of closely related whole genomes [J].
Angiuoli, Samuel V. ;
Salzberg, Steven L. .
BIOINFORMATICS, 2011, 27 (03) :334-342
[2]  
[Anonymous], 1997, ACM SIGACT NEWS
[3]   Median networks: Speedy construction and greedy reduction, one simulation, and two case studies from human mtDNA [J].
Bandelt, HJ ;
Macaulay, V ;
Richards, M .
MOLECULAR PHYLOGENETICS AND EVOLUTION, 2000, 16 (01) :8-28
[4]   The Natural History of Class I Primate Alcohol Dehydrogenases Includes Gene Duplication, Gene Loss, and Gene Conversion [J].
Carrigan, Matthew A. ;
Uryasev, Oleg ;
Davis, Ross P. ;
Zhai, LanMin ;
Hurley, Thomas D. ;
Benner, Steven A. .
PLOS ONE, 2012, 7 (07)
[5]   Histogram-based DNA analysis for the visualization of chromosome, genome and species information [J].
Costa, Antonio M. ;
Machado, Jose T. ;
Quelhas, Maria D. .
BIOINFORMATICS, 2011, 27 (09) :1207-1214
[6]   Efficient estimation of pairwise distances between genomes [J].
Domazet-Loso, Mirjana ;
Haubold, Bernhard .
BIOINFORMATICS, 2009, 25 (24) :3221-3227
[7]   MUSCLE: multiple sequence alignment with high accuracy and high throughput [J].
Edgar, RC .
NUCLEIC ACIDS RESEARCH, 2004, 32 (05) :1792-1797
[8]   BFAST: An Alignment Tool for Large Scale Genome Resequencing [J].
Homer, Nils ;
Merriman, Barry ;
Nelson, Stanley F. .
PLOS ONE, 2009, 4 (11) :A95-A106
[9]  
Schulz Marcel H, 2008, Int J Bioinform Res Appl, V4, P81, DOI 10.1504/IJBRA.2008.017165
[10]   Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions [J].
Sims, Gregory E. ;
Jun, Se-Ran ;
Wu, Guohong A. ;
Kim, Sung-Hou .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2009, 106 (08) :2677-2682