Using Apache Spark on genome assembly for scalable overlap-graph reduction

被引:3
作者
Paul, Alexander J. [1 ]
Lawrence, Dylan [2 ]
Song, Myoungkyu [3 ]
Lim, Seung-Hwan [4 ]
Pan, Chongle [5 ]
Ahn, Tae-Hyuk [1 ,6 ]
机构
[1] St Louis Univ, Bioinformat & Computat Biol Program, St Louis, MO 63103 USA
[2] Washington Univ, Computat & Syst Biol Program, St Louis, MO 63110 USA
[3] Univ Nebraska, Dept Comp Sci, Omaha, NE 68182 USA
[4] Oak Ridge Natl Lab, Natl Ctr Computat Sci, Oak Ridge, TN USA
[5] Univ Oklahoma, Sch Comp Sci, Norman, OK 73019 USA
[6] St Louis Univ, Dept Comp Sci, St Louis, MO 63103 USA
关键词
Graph reduction; Apache spark; Genome assembly; Cloud computing; Overlap-layout-consensus;
D O I
10.1186/s40246-019-0227-1
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Background De novo genome assembly is a technique that builds the genome of a specimen using overlaps of genomic fragments without additional work with reference sequence. Sequence fragments (called reads) are assembled as contigs and scaffolds by the overlaps. The quality of the de novo assembly depends on the length and continuity of the assembly. To enable faster and more accurate assembly of species, existing sequencing techniques have been proposed, for example, high-throughput next-generation sequencing and long-reads-producing third-generation sequencing. However, these techniques require a large amounts of computer memory when very huge-size overlap graphs are resolved. Also, it is challenging for parallel computation. Results To address the limitations, we propose an innovative algorithmic approach, called Scalable Overlap-graph Reduction Algorithms (SORA). SORA is an algorithm package that performs string graph reduction algorithms by Apache Spark. The SORA's implementations are designed to execute de novo genome assembly on either a single machine or a distributed computing platform. SORA efficiently compacts the number of edges on enormous graphing paths by adapting scalable features of graph processing libraries provided by Apache Spark, GraphX and GraphFrames. Conclusions We shared the algorithms and the experimental results at our project website, . We evaluated SORA with the human genome samples. First, it processed a nearly one billion edge graph on a distributed cloud cluster. Second, it processed mid-to-small size graphs on a single workstation within a short time frame. Overall, SORA achieved the linear-scaling simulations for the increased computing instances.
引用
收藏
页数:12
相关论文
共 27 条
  • [1] Abu-Doleh A, 2015, PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, P1013, DOI 10.1109/BigData.2015.7363853
  • [2] [Anonymous], 2016, The Journal of Machine Learning Research, DOI DOI 10.1145/2882903.2912565
  • [3] Next-generation DNA sequencing techniques
    Ansorge, Wilhelm J.
    [J]. NEW BIOTECHNOLOGY, 2009, 25 (04) : 195 - 203
  • [4] Assembling large genomes with single-molecule sequencing and locality-sensitive hashing
    Berlin, Konstantin
    Koren, Sergey
    Chin, Chen-Shan
    Drake, James P.
    Landolin, Jane M.
    Phillippy, Adam M.
    [J]. NATURE BIOTECHNOLOGY, 2015, 33 (06) : 623 - +
  • [5] Boisvert S, J COMPUT BIOL, V17, P1519
  • [6] Das T., 2012, P 9 USENIX C NETWORK
  • [7] Flicek P, 2009, NAT METHODS, V6, pS6, DOI [10.1038/NMETH.1376, 10.1038/nmeth.1376]
  • [8] Omega: an Overlap-graph de novo Assembler for Metagenomics
    Haider, Bahlul
    Ahn, Tae-Hyuk
    Bushnell, Brian
    Chai, Juanjuan
    Copeland, Alex
    Pan, Chongle
    [J]. BIOINFORMATICS, 2014, 30 (19) : 2717 - 2722
  • [9] Advantages and limitations of next-generation sequencing technologies: A comparison of electrophoresis and non-electrophoresis methods
    Hert, Daniel G.
    Fredlake, Christopher P.
    Barron, Annelise E.
    [J]. ELECTROPHORESIS, 2008, 29 (23) : 4618 - 4626
  • [10] Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation
    Koren, Sergey
    Walenz, Brian P.
    Berlin, Konstantin
    Miller, Jason R.
    Bergman, Nicholas H.
    Phillippy, Adam M.
    [J]. GENOME RESEARCH, 2017, 27 (05) : 722 - 736