WMSA: a novel method for multiple sequence alignment of DNA sequences

被引:15
作者
Wei, Yanming [1 ]
Zou, Quan [2 ,3 ]
Tang, Furong [2 ]
Yu, Liang [1 ]
机构
[1] Xidian Univ, Sch Comp Sci & Technol, Xian 710071, Shaanxi, Peoples R China
[2] Univ Elect Sci & Technol China, Yangtze Delta Reg Inst Quzhou, Quzhou 324003, Zhejiang, Peoples R China
[3] Univ Elect Sci & Technol China, Inst Fundamental & Frontier Sci, Chengdu 610054, Sichuan, Peoples R China
基金
中国博士后科学基金; 中国国家自然科学基金;
关键词
CD-HIT; ALGORITHM; PROTEIN; MAFFT;
D O I
10.1093/bioinformatics/btac658
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation Multiple sequence alignment (MSA) is a fundamental problem in bioinformatics. The quality of alignment will affect downstream analysis. MAFFT has adopted the Fast Fourier Transform method for searching the homologous segments and using them as anchors to divide the sequences, then making alignment only on segments, which can save time and memory without overly reducing the sequence alignment quality. MAFFT becomes slow when the dataset is large. Results We made a software, WMSA, which uses the divide-and-conquer method to split the sequences into clusters, aligns those clusters into profiles with the center star strategy and then makes a progressive profile-profile alignment. The alignment is conducted by the compiled algorithms of MAFFT, K-Band with multithread parallelism. Our method can balance time, space and quality and performs better than MAFFT in test experiments on highly conserved datasets. Availability and implementation Source code is freely available at , which is implemented in C/C++ and supported on Linux, and datasets are available at . Supplementary information are available at Bioinformatics online.
引用
收藏
页码:5019 / 5025
页数:7
相关论文
共 26 条
[11]   COVID-Align: accurate online alignment of hCoV-19 genomes using a profile HMM [J].
Lemoine, Frederic ;
Blassel, Luc ;
Voznica, Jakub ;
Gascuel, Olivier .
BIOINFORMATICS, 2021, 37 (12) :1761-1762
[12]   Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences [J].
Li, Weizhong ;
Godzik, Adam .
BIOINFORMATICS, 2006, 22 (13) :1658-1659
[13]   Clustering of highly homologous sequences to reduce the size of large protein databases [J].
Li, WZ ;
Jaroszewski, L ;
Godzik, A .
BIOINFORMATICS, 2001, 17 (03) :282-283
[14]   FAME: fast and memory efficient multiple sequences alignment tool through compatible chain of roots [J].
Naznooshsadat, Etminan ;
Elham, Parvinnia ;
Ali, Sharifi-Zarchi .
BIOINFORMATICS, 2020, 36 (12) :3662-3668
[15]   THE NEIGHBOR-JOINING METHOD - A NEW METHOD FOR RECONSTRUCTING PHYLOGENETIC TREES [J].
SAITOU, N ;
NEI, M .
MOLECULAR BIOLOGY AND EVOLUTION, 1987, 4 (04) :406-425
[16]   SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation [J].
Shen, Wei ;
Le, Shuai ;
Li, Yan ;
Hu, Fuquan .
PLOS ONE, 2016, 11 (10)
[17]   GISAID: Global initiative on sharing all influenza data - from vision to reality [J].
Shu, Yuelong ;
McCauley, John .
EUROSURVEILLANCE, 2017, 22 (13) :2-4
[18]  
SOKAL ROBERT R., 1958, UNIV KANSAS SCI BULL, V38, P1409
[19]   On the origin and continuing evolution of SARS-CoV-2 [J].
Tang, Xiaolu ;
Wu, Changcheng ;
Li, Xiang ;
Song, Yuhe ;
Yao, Xinmin ;
Wu, Xinkai ;
Duan, Yuange ;
Zhang, Hong ;
Wang, Yirong ;
Qian, Zhaohui ;
Cui, Jie ;
Lu, Jian .
NATIONAL SCIENCE REVIEW, 2020, 7 (06) :1012-1023
[20]   HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing [J].
Wan, Shixiang ;
Zou, Quan .
ALGORITHMS FOR MOLECULAR BIOLOGY, 2017, 12