A multiple sequence alignment method with sequence vectorization

被引:0
作者
Ji, Guoli [1 ]
Zeng, Yong [1 ]
Yang, Zijiang [2 ]
Ye, Congting [1 ]
Yao, Jingci [1 ]
机构
[1] Xiamen Univ, Dept Automat, Xiamen, Peoples R China
[2] York Univ, Sch Informat Technol, Toronto, ON M3J 2R7, Canada
基金
高等学校博士学科点专项科研基金; 中国国家自然科学基金;
关键词
k-means; LemK_MSA; Lempel-Ziv; Multiple sequence alignment; LEMPEL-ZIV; COMPLEXITY; PROGRAMS; DISTANCE;
D O I
10.1108/EC-01-2013-0026
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Purpose - The time complexity of most multiple sequence alignment algorithm is O(N-2) or O(N-3) (N is the number of sequences). In addition, with the development of biotechnology, the amount of biological sequences grows significantly. The traditional methods have some difficulties in handling large-scale sequence. The proposed Lemk_MSA method aims to reduce the time complexity, especially for large-scale sequences. At the same time, it can keep similar accuracy level compared to the traditional methods. Design/methodology/approach - LemK_MSA converts multiple sequence alignment into corresponding 10D vector alignment by ten types of copy modes based on Lempel-Ziv. Then, it uses k-means algorithm and NJ algorithm to divide the sequences into several groups and calculate guide tree of each group. A complete guide tree for multiple sequence alignment could be constructed by merging guide tree of every group. Moreover, for large-scale multiple sequence, Lemk_MSA proposes a GPU-based parallel way for distance matrix calculation. Findings - Under this approach, the time efficiency to process multiple sequence alignment can be improved. The high-throughput mouse antibody sequences are used to validate the proposed method. Compared to ClustalW, MAFFT and Mbed, LemK_MSA is more than ten times efficient while ensuring the alignment accuracy at the same time. Originality/value - This paper proposes a novel method with sequence vectorization for multiple sequence alignment based on Lempel-Ziv. A GPU-based parallel method has been designed for large-scale distance matrix calculation. It provides a new way for multiple sequence alignment research.
引用
收藏
页码:283 / 296
页数:14
相关论文
共 21 条
[1]   MUSCLE: multiple sequence alignment with high accuracy and high throughput [J].
Edgar, RC .
NUCLEIC ACIDS RESEARCH, 2004, 32 (05) :1792-1797
[2]   PROGRESSIVE SEQUENCE ALIGNMENT AS A PREREQUISITE TO CORRECT PHYLOGENETIC TREES [J].
FENG, DF ;
DOOLITTLE, RF .
JOURNAL OF MOLECULAR EVOLUTION, 1987, 25 (04) :351-360
[3]   SIMULATION OF A ROTATING DEVICE THAT REDUCES THE AERODYNAMIC DRAG OF AN AUTOMOBILE [J].
Gagnon, Louis ;
Richard, Marc J. ;
Levesque, Benoit .
TRANSACTIONS OF THE CANADIAN SOCIETY FOR MECHANICAL ENGINEERING, 2011, 35 (02) :229-249
[4]   A benchmark of multiple sequence alignment programs upon structural RNAs [J].
Gardner, PP ;
Wilm, A ;
Washietl, S .
NUCLEIC ACIDS RESEARCH, 2005, 33 (08) :2433-2439
[5]  
Gordon B., 2010, ALGORITHMS MOL BIOL, V5
[6]  
Hartigan JA, 1979, J R STAT SOC C-APPL, V28, P100
[7]  
Ji GL, 2009, LECT NOTES COMPUT SC, V5863, P151
[8]   MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform [J].
Katoh, K ;
Misawa, K ;
Kuma, K ;
Miyata, T .
NUCLEIC ACIDS RESEARCH, 2002, 30 (14) :3059-3066
[9]   COMPLEXITY OF FINITE SEQUENCES [J].
LEMPEL, A ;
ZIV, J .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1976, 22 (01) :75-81
[10]   NVIDIA Tesla: A unified graphics and computing architecture [J].
Lindholm, Erik ;
Nickolls, John ;
Oberman, Stuart ;
Montrym, John .
IEEE MICRO, 2008, 28 (02) :39-55