SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data

被引:66
作者
Abuin, Jose M. [1 ]
Pichel, Juan C. [1 ]
Pena, Tomas F. [1 ]
Amigo, Jorge [2 ,3 ]
机构
[1] Univ Santiago de Compostela, Ctr Invest Tecnoloxias Informac CITIUS, Santiago De Compostela, Spain
[2] Fdn Publ Galega Med Xenom SERGAS, Santiago De Compostela, Spain
[3] Inst Invest Sanitaria Santiago de Compostela, Grp Med Xenom, Santiago De Compostela, Spain
来源
PLOS ONE | 2016年 / 11卷 / 05期
关键词
READ ALIGNMENT; ALIGNER; FORMAT;
D O I
10.1371/journal.pone.0155461
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Next-generation sequencing (NGS) technologies have led to a huge amount of genomic data that need to be analyzed and interpreted. This fact has a huge impact on the DNA sequence alignment process, which nowadays requires the mapping of billions of small DNA sequences onto a reference genome. In this way, sequence alignment remains the most time-consuming stage in the sequence analysis workflow. To deal with this issue, state of the art aligners take advantage of parallelization strategies. However, the existent solutions show limited scalability and have a complex implementation. In this work we introduce SparkBWA, a new tool that exploits the capabilities of a big data technology as Spark to boost the performance of one of the most widely adopted aligner, the Burrows-Wheeler Aligner (BWA). The design of SparkBWA uses two independent software layers in such a way that no modifications to the original BWA source code are required, which assures its compatibility with any BWA version (future or legacy). SparkBWA is evaluated in different scenarios showing noticeable results in terms of performance and scalability. A comparison to other parallel BWA-based aligners validates the benefits of our approach. Finally, an intuitive and flexible API is provided to NGS professionals in order to facilitate the acceptance and adoption of the new tool. The source code of the software described in this paper is publicly available at https://github.com/citiususc/SparkBWA, with a GPL3 license.
引用
收藏
页数:21
相关论文
共 29 条
  • [1] BigBWA: approaching the Burrows-Wheeler aligner to Big Data technologies
    Abuin, Jose M.
    Pichel, Juan C.
    Pena, Tomas F.
    Amigo, Jorge
    [J]. BIOINFORMATICS, 2015, 31 (24) : 4003 - 4005
  • [2] A map of human genome variation from population-scale sequencing
    Altshuler, David
    Durbin, Richard M.
    Abecasis, Goncalo R.
    Bentley, David R.
    Chakravarti, Aravinda
    Clark, Andrew G.
    Collins, Francis S.
    De la Vega, Francisco M.
    Donnelly, Peter
    Egholm, Michael
    Flicek, Paul
    Gabriel, Stacey B.
    Gibbs, Richard A.
    Knoppers, Bartha M.
    Lander, Eric S.
    Lehrach, Hans
    Mardis, Elaine R.
    McVean, Gil A.
    Nickerson, DebbieA.
    Peltonen, Leena
    Schafer, Alan J.
    Sherry, Stephen T.
    Wang, Jun
    Wilson, Richard K.
    Gibbs, Richard A.
    Deiros, David
    Metzker, Mike
    Muzny, Donna
    Reid, Jeff
    Wheeler, David
    Wang, Jun
    Li, Jingxiang
    Jian, Min
    Li, Guoqing
    Li, Ruiqiang
    Liang, Huiqing
    Tian, Geng
    Wang, Bo
    Wang, Jian
    Wang, Wei
    Yang, Huanming
    Zhang, Xiuqing
    Zheng, Huisong
    Lander, Eric S.
    Altshuler, David L.
    Ambrogio, Lauren
    Bloom, Toby
    Cibulskis, Kristian
    Fennell, Tim J.
    Gabriel, Stacey B.
    [J]. NATURE, 2010, 467 (7319) : 1061 - 1073
  • [3] [Anonymous], 2010, Proceedings of the International Symposium on High Performance Distributed Computing (HPDC '10), DOI [10.1145/1851476.1851594, DOI 10.1145/1851476.1851594]
  • [4] [Anonymous], 2012, NSDI
  • [5] [Anonymous], 2011, NSDI, DOI DOI 10.1016/0375-6505(85)90011-2
  • [6] Arram J, 2013, LECT NOTES COMPUT SC, V7806, P13, DOI 10.1007/978-3-642-36812-7_2
  • [7] The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants
    Cock, Peter J. A.
    Fields, Christopher J.
    Goto, Naohisa
    Heuer, Michael L.
    Rice, Peter M.
    [J]. NUCLEIC ACIDS RESEARCH, 2010, 38 (06) : 1767 - 1771
  • [8] mBWA: A Massively Parallel Sequence Reads Aligner
    Cui, Yingbo
    Liao, Xiangke
    Zhu, Xiaoqian
    Wang, Bingqiang
    Peng, Shaoliang
    [J]. 8TH INTERNATIONAL CONFERENCE ON PRACTICAL APPLICATIONS OF COMPUTATIONAL BIOLOGY & BIOINFORMATICS (PACBB 2014), 2014, 294 : 113 - 120
  • [9] Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137
  • [10] Halvade: scalable sequence analysis with MapReduce
    Decap, Dries
    Reumers, Joke
    Herzeel, Charlotte
    Costanza, Pascal
    Fostier, Jan
    [J]. BIOINFORMATICS, 2015, 31 (15) : 2482 - 2488