Halvade: scalable sequence analysis with MapReduce

被引:48
作者
Decap, Dries [1 ,2 ]
Reumers, Joke [2 ,3 ]
Herzeel, Charlotte [2 ,4 ]
Costanza, Pascal [2 ,5 ]
Fostier, Jan [1 ,2 ]
机构
[1] Ghent Univ iMinds, Dept Informat Technol, B-9050 Ghent, Belgium
[2] ExaSci Life Lab, B-3001 Leuven, Belgium
[3] Janssen Res & Dev, B-2340 Beerse, Belgium
[4] IMEC, B-3001 Leuven, Belgium
[5] Intel Corp Belgium, Louvain, Belgium
关键词
GENOME ANALYSIS; ALIGNMENT; FRAMEWORK; CLOUD;
D O I
10.1093/bioinformatics/btv179
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine. Results: We present Halvade, a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner. As an example, a DNA sequencing analysis pipeline for variant calling has been implemented according to the GATK Best Practices recommendations, supporting both whole genome and whole exome sequencing. Using a 15-node computer cluster with 360 CPU cores in total, Halvade processes the NA12878 dataset (human, 100 bp paired-end reads, 50x coverage) in <3 h with very high parallel efficiency. Even on a single, multi-core machine, Halvade attains a significant speedup compared with running the individual tools with multithreading. Availability and implementation: Halvade is written in Java and uses the Hadoop MapReduce 2.0 API. It supports a wide range of distributions of Hadoop, including Cloudera and Amazon EMR.
引用
收藏
页码:2482 / 2488
页数:7
相关论文
共 18 条
[1]  
[Anonymous], 2013, TECHNICAL REPORT
[2]   Mapreduce: Simplified data processing on large clusters [J].
Dean, Jeffrey ;
Ghemawat, Sanjay .
COMMUNICATIONS OF THE ACM, 2008, 51 (01) :107-113
[3]   A framework for variation discovery and genotyping using next-generation DNA sequencing data [J].
DePristo, Mark A. ;
Banks, Eric ;
Poplin, Ryan ;
Garimella, Kiran V. ;
Maguire, Jared R. ;
Hartl, Christopher ;
Philippakis, Anthony A. ;
del Angel, Guillermo ;
Rivas, Manuel A. ;
Hanna, Matt ;
McKenna, Aaron ;
Fennell, Tim J. ;
Kernytsky, Andrew M. ;
Sivachenko, Andrey Y. ;
Cibulskis, Kristian ;
Gabriel, Stacey B. ;
Altshuler, David ;
Daly, Mark J. .
NATURE GENETICS, 2011, 43 (05) :491-+
[4]   Tools for mapping high-throughput sequencing data [J].
Fonseca, Nuno A. ;
Rung, Johan ;
Brazma, Alvis ;
Marioni, John C. .
BIOINFORMATICS, 2012, 28 (24) :3169-3177
[5]   Searching for SNPs with cloud computing [J].
Langmead, Ben ;
Schatz, Michael C. ;
Lin, Jimmy ;
Pop, Mihai ;
Salzberg, Steven L. .
GENOME BIOLOGY, 2009, 10 (11)
[6]   Ultrafast and memory-efficient alignment of short DNA sequences to the human genome [J].
Langmead, Ben ;
Trapnell, Cole ;
Pop, Mihai ;
Salzberg, Steven L. .
GENOME BIOLOGY, 2009, 10 (03)
[7]   Fast and accurate short read alignment with Burrows-Wheeler transform [J].
Li, Heng ;
Durbin, Richard .
BIOINFORMATICS, 2009, 25 (14) :1754-1760
[8]   SOAP: short oligonucleotide alignment program [J].
Li, Ruiqiang ;
Li, Yingrui ;
Kristiansen, Karsten ;
Wang, Jun .
BIOINFORMATICS, 2008, 24 (05) :713-714
[9]   The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data [J].
McKenna, Aaron ;
Hanna, Matthew ;
Banks, Eric ;
Sivachenko, Andrey ;
Cibulskis, Kristian ;
Kernytsky, Andrew ;
Garimella, Kiran ;
Altshuler, David ;
Gabriel, Stacey ;
Daly, Mark ;
DePristo, Mark A. .
GENOME RESEARCH, 2010, 20 (09) :1297-1303
[10]   Genotype and SNP calling from next-generation sequencing data [J].
Nielsen, Rasmus ;
Paul, Joshua S. ;
Albrechtsen, Anders ;
Song, Yun S. .
NATURE REVIEWS GENETICS, 2011, 12 (06) :443-451