A multiple sequence alignment method with sequence vectorization

被引:0
作者
Ji, Guoli [1 ]
Zeng, Yong [1 ]
Yang, Zijiang [2 ]
Ye, Congting [1 ]
Yao, Jingci [1 ]
机构
[1] Xiamen Univ, Dept Automat, Xiamen, Peoples R China
[2] York Univ, Sch Informat Technol, Toronto, ON M3J 2R7, Canada
基金
中国国家自然科学基金; 高等学校博士学科点专项科研基金;
关键词
k-means; LemK_MSA; Lempel-Ziv; Multiple sequence alignment; LEMPEL-ZIV; COMPLEXITY; PROGRAMS; DISTANCE;
D O I
10.1108/EC-01-2013-0026
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Purpose - The time complexity of most multiple sequence alignment algorithm is O(N-2) or O(N-3) (N is the number of sequences). In addition, with the development of biotechnology, the amount of biological sequences grows significantly. The traditional methods have some difficulties in handling large-scale sequence. The proposed Lemk_MSA method aims to reduce the time complexity, especially for large-scale sequences. At the same time, it can keep similar accuracy level compared to the traditional methods. Design/methodology/approach - LemK_MSA converts multiple sequence alignment into corresponding 10D vector alignment by ten types of copy modes based on Lempel-Ziv. Then, it uses k-means algorithm and NJ algorithm to divide the sequences into several groups and calculate guide tree of each group. A complete guide tree for multiple sequence alignment could be constructed by merging guide tree of every group. Moreover, for large-scale multiple sequence, Lemk_MSA proposes a GPU-based parallel way for distance matrix calculation. Findings - Under this approach, the time efficiency to process multiple sequence alignment can be improved. The high-throughput mouse antibody sequences are used to validate the proposed method. Compared to ClustalW, MAFFT and Mbed, LemK_MSA is more than ten times efficient while ensuring the alignment accuracy at the same time. Originality/value - This paper proposes a novel method with sequence vectorization for multiple sequence alignment based on Lempel-Ziv. A GPU-based parallel method has been designed for large-scale distance matrix calculation. It provides a new way for multiple sequence alignment research.
引用
收藏
页码:283 / 296
页数:14
相关论文
共 21 条
[21]   QOMA: quasi-optimal multiple alignment of protein sequences [J].
Zhang, Xu ;
Kahveci, Tamer .
BIOINFORMATICS, 2007, 23 (02) :162-168