Halvade: scalable sequence analysis with MapReduce

被引：48

作者：

Decap, Dries ^{[1
,2
]}

Reumers, Joke ^{[2
,3
]}

Herzeel, Charlotte ^{[2
,4
]}

Costanza, Pascal ^{[2
,5
]}

Fostier, Jan ^{[1
,2
]}

机构：

[1] Ghent Univ iMinds, Dept Informat Technol, B-9050 Ghent, Belgium

[2] ExaSci Life Lab, B-3001 Leuven, Belgium

[3] Janssen Res & Dev, B-2340 Beerse, Belgium

[4] IMEC, B-3001 Leuven, Belgium

[5] Intel Corp Belgium, Louvain, Belgium

来源：

BIOINFORMATICS | 2015年 / 31卷 / 15期

关键词：

GENOME ANALYSIS; ALIGNMENT; FRAMEWORK; CLOUD;

D O I：

10.1093/bioinformatics/btv179

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Motivation: Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine. Results: We present Halvade, a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner. As an example, a DNA sequencing analysis pipeline for variant calling has been implemented according to the GATK Best Practices recommendations, supporting both whole genome and whole exome sequencing. Using a 15-node computer cluster with 360 CPU cores in total, Halvade processes the NA12878 dataset (human, 100 bp paired-end reads, 50x coverage) in <3 h with very high parallel efficiency. Even on a single, multi-core machine, Halvade attains a significant speedup compared with running the individual tools with multithreading. Availability and implementation: Halvade is written in Java and uses the Hadoop MapReduce 2.0 API. It supports a wide range of distributions of Hadoop, including Cloudera and Amazon EMR.

引用

页码：2482 / 2488

页数：7

共 18 条

[1]

[Anonymous], 2013, TECHNICAL REPORT

[2] Mapreduce: Simplified data processing on large clusters [J].

Dean, Jeffrey ;

Ghemawat, Sanjay .

COMMUNICATIONS OF THE ACM, 2008, 51 (01) :107-113

[3] A framework for variation discovery and genotyping using next-generation DNA sequencing data [J].

DePristo, Mark A. ;

Banks, Eric ;

Poplin, Ryan ;

Garimella, Kiran V. ;

Maguire, Jared R. ;

Hartl, Christopher ;

Philippakis, Anthony A. ;

del Angel, Guillermo ;

Rivas, Manuel A. ;

Hanna, Matt ;

McKenna, Aaron ;

Fennell, Tim J. ;

Kernytsky, Andrew M. ;

Sivachenko, Andrey Y. ;

Cibulskis, Kristian ;

Gabriel, Stacey B. ;

Altshuler, David ;

Daly, Mark J. .

NATURE GENETICS, 2011, 43 (05) :491-+

[4] Tools for mapping high-throughput sequencing data [J].

Fonseca, Nuno A. ;

Rung, Johan ;

Brazma, Alvis ;

Marioni, John C. .

BIOINFORMATICS, 2012, 28 (24) :3169-3177

[5] Searching for SNPs with cloud computing [J].

Langmead, Ben ;

Schatz, Michael C. ;

Lin, Jimmy ;

Pop, Mihai ;

Salzberg, Steven L. .

GENOME BIOLOGY, 2009, 10 (11)

[6] Ultrafast and memory-efficient alignment of short DNA sequences to the human genome [J].

Langmead, Ben ;

Trapnell, Cole ;

Pop, Mihai ;

Salzberg, Steven L. .

GENOME BIOLOGY, 2009, 10 (03)

[7] Fast and accurate short read alignment with Burrows-Wheeler transform [J].

Li, Heng ;

Durbin, Richard .

BIOINFORMATICS, 2009, 25 (14) :1754-1760

[8] SOAP: short oligonucleotide alignment program [J].

Li, Ruiqiang ;

Li, Yingrui ;

Kristiansen, Karsten ;

Wang, Jun .

BIOINFORMATICS, 2008, 24 (05) :713-714

[9] The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data [J].

McKenna, Aaron ;

Hanna, Matthew ;

Banks, Eric ;

Sivachenko, Andrey ;

Cibulskis, Kristian ;

Kernytsky, Andrew ;

Garimella, Kiran ;

Altshuler, David ;

Gabriel, Stacey ;

Daly, Mark ;

DePristo, Mark A. .

GENOME RESEARCH, 2010, 20 (09) :1297-1303

[10] Genotype and SNP calling from next-generation sequencing data [J].

Nielsen, Rasmus ;

Paul, Joshua S. ;

Albrechtsen, Anders ;

Song, Yun S. .

NATURE REVIEWS GENETICS, 2011, 12 (06) :443-451

← 1 2 →