WMSA: a novel method for multiple sequence alignment of DNA sequences

被引:15
作者
Wei, Yanming [1 ]
Zou, Quan [2 ,3 ]
Tang, Furong [2 ]
Yu, Liang [1 ]
机构
[1] Xidian Univ, Sch Comp Sci & Technol, Xian 710071, Shaanxi, Peoples R China
[2] Univ Elect Sci & Technol China, Yangtze Delta Reg Inst Quzhou, Quzhou 324003, Zhejiang, Peoples R China
[3] Univ Elect Sci & Technol China, Inst Fundamental & Frontier Sci, Chengdu 610054, Sichuan, Peoples R China
基金
中国博士后科学基金; 中国国家自然科学基金;
关键词
CD-HIT; ALGORITHM; PROTEIN; MAFFT;
D O I
10.1093/bioinformatics/btac658
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation Multiple sequence alignment (MSA) is a fundamental problem in bioinformatics. The quality of alignment will affect downstream analysis. MAFFT has adopted the Fast Fourier Transform method for searching the homologous segments and using them as anchors to divide the sequences, then making alignment only on segments, which can save time and memory without overly reducing the sequence alignment quality. MAFFT becomes slow when the dataset is large. Results We made a software, WMSA, which uses the divide-and-conquer method to split the sequences into clusters, aligns those clusters into profiles with the center star strategy and then makes a progressive profile-profile alignment. The alignment is conducted by the compiled algorithms of MAFFT, K-Band with multithread parallelism. Our method can balance time, space and quality and performs better than MAFFT in test experiments on highly conserved datasets. Availability and implementation Source code is freely available at , which is implemented in C/C++ and supported on Linux, and datasets are available at . Supplementary information are available at Bioinformatics online.
引用
收藏
页码:5019 / 5025
页数:7
相关论文
共 26 条
[1]   A fork() in the road [J].
Baumann, Andrew ;
Appavoo, Jonathan ;
Krieger, Orran ;
Roscoe, Timothy .
PROCEEDINGS OF THE WORKSHOP ON HOT TOPICS IN OPERATING SYSTEMS (HOTOS '19), 2019, :14-22
[2]   AN ALGORITHM FOR MACHINE CALCULATION OF COMPLEX FOURIER SERIES [J].
COOLEY, JW ;
TUKEY, JW .
MATHEMATICS OF COMPUTATION, 1965, 19 (90) :297-&
[3]   FAMSA: Fast and accurate multiple sequence alignment of huge protein families [J].
Deorowicz, Sebastian ;
Debudaj-Grabysz, Agnieszka ;
Gudys, Adam .
SCIENTIFIC REPORTS, 2016, 6
[4]   NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes [J].
DeSantis, T. Z. ;
Hugenholtz, P. ;
Keller, K. ;
Brodie, E. L. ;
Larsen, N. ;
Piceno, Y. M. ;
Phan, R. ;
Andersen, G. L. .
NUCLEIC ACIDS RESEARCH, 2006, 34 :W394-W399
[5]   Nextflow enables reproducible computational workflows [J].
Di Tommaso, Paolo ;
Chatzou, Maria ;
Floden, Evan W. ;
Prieto Barja, Pablo ;
Palumbo, Emilio ;
Notredame, Cedric .
NATURE BIOTECHNOLOGY, 2017, 35 (04) :316-319
[6]  
Durbin R., 1998, Biological sequence analysis: Probabilistic models of proteins and nucleic acids
[7]   MUSCLE: multiple sequence alignment with high accuracy and high throughput [J].
Edgar, RC .
NUCLEIC ACIDS RESEARCH, 2004, 32 (05) :1792-1797
[8]   CD-HIT: accelerated for clustering the next-generation sequencing data [J].
Fu, Limin ;
Niu, Beifang ;
Zhu, Zhengwei ;
Wu, Sitao ;
Li, Weizhong .
BIOINFORMATICS, 2012, 28 (23) :3150-3152
[9]   MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform [J].
Katoh, K ;
Misawa, K ;
Kuma, K ;
Miyata, T .
NUCLEIC ACIDS RESEARCH, 2002, 30 (14) :3059-3066
[10]   Parallelization of the MAFFT multiple sequence alignment program [J].
Katoh, Kazutaka ;
Toh, Hiroyuki .
BIOINFORMATICS, 2010, 26 (15) :1899-1900