Spark-based parallelization of basic local alignment search tool

被引:2
作者
Wang H. [1 ,2 ]
Li L. [1 ,2 ]
Zhou C. [1 ,2 ]
Lin H. [1 ,2 ]
Deng D. [1 ,2 ]
机构
[1] College of Data Science and Application, Inner Mongolia University of Technology, Hohhot
[2] Inner Mongolia Autonomous Region Engineering and, Technology Research Center of Big Data Based Software Service, Inner Mongolia University of Technology, Hohhot
关键词
Basic local alignment search tool; Parallelization; Sequence alignment; Spark; Speedup;
D O I
10.7546/ijba.2020.24.1.000767
中图分类号
学科分类号
摘要
Sequence alignment is a key link of bioinformatics analysis. The basic local alignment search tool (BLAST) is a popular sequence alignment algorithm with high accuracy. However, the BLAST is inefficient in comparing and analyzing a massive amount of gene sequencing data. To solve the problem, this paper designs a distributed parallel BLAST method called SparkBLAST, based on the big data technique Spark. Under the in-memory computing framework Spark, SparkBLAST identifies the task of sequence alignment, divides the sequence dataset, and compares the sequence data. The Apache Hadoop YARN was adopted to task scheduling and resource allocation. Finally, the SparkBLAST was compared with standalone BLAST through experiments. The results show that SparkBLAST realized the speedup ratio of 3.95, without sacrificing the accuracy. In other words, SparkBLAST greatly outshines the standalone BLAST in calculation efficiency. The research findings provide bioinformatics researchers a highly efficient tool for sequence alignment. © 2020 by the authors.
引用
收藏
页码:87 / 98
页数:11
相关论文
共 22 条
[1]  
Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J., Basic Local Alignment Search Tool, Journal of Molecular Biology, 215, 3, pp. 403-410, (1990)
[2]  
Awan A.J., Brorsson M., Vlassov V., Ayguade E., Architectural Impact on Performance of In-memory Data Analytics: Apache Spark Case Study, (2016)
[3]  
Bjornson R.D., Sherman A.H., Weston S.B., Willard N., Wing J., TurboBLAST®: A Parallel Implementation of BLAST built on the TurboHub, Proceedings of International Parallel and Distributed Processing Symposium, pp. 1-8, (2002)
[4]  
BLAST+, (2020)
[5]  
Huson D.H., Xie C., A Poor Man's BLASTX-High-throughput Metagenomic Protein Database Search Using Pauda, Bioinformatics, 30, 1, pp. 38-39, (2014)
[6]  
Islam N.S., Wasiur-Rahman M., Lu X., Shankar D., Panda D.K., Performance Characterization and Acceleration of In-memory File Systems for Hadoop and Spark Applications on HPC Clusters, Proc. of the 2015 IEEE Int Conf on Big Data, pp. 243-252, (2015)
[7]  
Jacob A., Lancaster J., Buhler J., Harris B., Chamberlain R.D., Mercury BLASTP: Accelerating Protein Sequence Alignment, ACM Transactions on Reconfigurable Technology & Systems, 1, 2, pp. 9-16, (2008)
[8]  
Kent W.J., BLAT-The BLAST-like Alignment Tool, Genome Research, 12, 4, pp. 656-664, (2002)
[9]  
Khare N., Khare A., Khan F., HCudaBLAST: An Implementation of BLAST on Hadoop and Cuda, Journal of Big Data, 4, 1, (2017)
[10]  
Kuminsky K., VMware vCenter Cookbook, (2015)