Parallel NGS Assembly Using Distributed Assembly Graphs Enriched with Biological Knowledge

被引:0
作者
Warnke-Sommer, Julia D. [1 ]
Ali, Hesham H. [2 ]
机构
[1] Univ Nebraska, Dept Pathol & Microbiol, Med Ctr, Omaha, NE 68198 USA
[2] Univ Nebraska, Coll Informat Sci & Technol, Omaha, NE 68182 USA
来源
2017 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW) | 2017年
关键词
next generation sequencing; assembly graph; high performance computing; algorithms; SCHEME;
D O I
10.1109/IPDPSW.2017.143
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
High performance computing has become essential for many biomedical applications as the production of biological data continues to increase. Next Generation Sequencing (NGS) technologies are capable of producing millions to even billions of short DNA fragments called reads. These short reads are assembled into larger sequences called contigs by graph theoretic software tools called assemblers. High performance computing has been applied to reduce the computational burden of several steps of the NGS data assembly process. Several parallel de Bruijn graph assemblers rely on a distributed assembly graph. However, the majority of assemblers that utilize distributed assembly graphs do not take the input properties of the data set into consideration to improve the graph partitioning process. Furthermore, the graph theoretic foundation for the majority of these assemblers is a distributed de Bruijn graph. In this paper, we introduce a distributed overlap graph based model upon which our parallel assembler Focus is built. The contribution of this paper is three-fold. First, we demonstrate that the application of data specific knowledge regarding the inherent linearity of DNA sequences can be used to improve the partitioning processes for distributing the assembly graph. Second, we implement several parallel graph algorithms for assembly with greatly improved speedup. Finally, we demonstrate that for metagenomics datasets, the graph partitioning provides insights into the structure of the microbial community.
引用
收藏
页码:273 / 282
页数:10
相关论文
共 18 条
[1]   Ray: Simultaneous Assembly of Reads from a Mix of High-Throughput Sequencing Technologies [J].
Boisvert, Sebastien ;
Laviolette, Francois ;
Corbeil, Jacques .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2010, 17 (11) :1519-1533
[2]  
CHANG YP, 2012, BMC GENOMICS, V13, DOI DOI 10.1186/1471-2369-13-96
[3]  
DUTT ST, 1993, 1993 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER-AIDED DESIGN - DIGEST OF TECHNICAL PAPERS, P370, DOI 10.1109/ICCAD.1993.580083
[4]   HipMer: An Extreme-Scale De Novo Genome Assembler [J].
Georganas, Evangelos ;
Buluc, Aydin ;
Chapman, Jarrod ;
Hofmeyr, Steven ;
Aluru, Chaitanya ;
Egan, Rob ;
Oliker, Leonid ;
Rokhsar, Daniel ;
Yelick, Katherine .
PROCEEDINGS OF SC15: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2015,
[5]   PCAP: A whole-genome assembly program [J].
Huang, XQ ;
Wang, JM ;
Aluru, S ;
Yang, SP ;
Hillier, L .
GENOME RESEARCH, 2003, 13 (09) :2164-2170
[6]   A fast and high quality multilevel scheme for partitioning irregular graphs [J].
Karypis, G ;
Kumar, V .
SIAM JOURNAL ON SCIENTIFIC COMPUTING, 1998, 20 (01) :359-392
[7]   Multilevel k-way partitioning scheme for irregular graphs [J].
Karypis, G ;
Kumar, V .
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1998, 48 (01) :96-129
[8]  
Kernighan B. W., 1970, Bell System Technical Journal, V49, P291
[9]   Faster suffix sorting [J].
Larsson, N. Jesper ;
Sadakane, Kunihiko .
THEORETICAL COMPUTER SCIENCE, 2007, 387 (03) :258-272
[10]   MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph [J].
Li, Dinghua ;
Liu, Chi-Man ;
Luo, Ruibang ;
Sadakane, Kunihiko ;
Lam, Tak-Wah .
BIOINFORMATICS, 2015, 31 (10) :1674-1676