The present and future of de novo whole-genome assembly

被引:148
作者
Sohn, Jang-il [1 ,2 ]
Nam, Jin-Wu [1 ,3 ]
机构
[1] Hanyang Univ, FTC 1123,222 Wangsimni Ro, Seoul 04763, South Korea
[2] Korea Univ, Stat Phys, Seoul, South Korea
[3] MIT, Whitehead Inst Biomed Res, Computat Biol, Cambridge, MA 02139 USA
基金
新加坡国家研究基金会;
关键词
de novo assembly algorithms; de Bruijn graph; next-generation sequencing; single-molecule sequencing; HYBRID ERROR-CORRECTION; SEQUENCING DATA; BRUIJN GRAPHS; SINGLE-CELL; DNA; LONG; EFFICIENT; QUALITY; ACCURATE; RETROTRANSPOSONS;
D O I
10.1093/bib/bbw096
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
As the advent of next-generation sequencing (NGS) technology, various de novo assembly algorithms based on the de Bruijn graph have been developed to construct chromosome-level sequences. However, numerous technical or computational challenges in de novo assembly still remain, although many bright ideas and heuristics have been suggested to tackle the challenges in both experimental and computational settings. In this review, we categorize de novo assemblers on the basis of the type of de Bruijn graphs (Hamiltonian and Eulerian) and discuss the challenges of de novo assembly for short NGS reads regarding computational complexity and assembly ambiguity. Then, we discuss how the limitations of the short reads can be overcome by using a single-molecule sequencing platform that generates long reads of up to several kilobases. In fact, the long read assembly has caused a paradigm shift in whole-genome assembly in terms of algorithms and supporting steps. We also summarize (i) hybrid assemblies using both short and long reads and (ii) overlap-based assemblies for long reads and discuss their challenges and future prospects. This review provides guidelines to determine the optimal approach for a given input data type, computational budget or genome.
引用
收藏
页码:23 / 40
页数:18
相关论文
共 152 条
  • [1] Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries
    Aird, Daniel
    Ross, Michael G.
    Chen, Wei-Sheng
    Danielsson, Maxwell
    Fennell, Timothy
    Russ, Carsten
    Jaffe, David B.
    Nusbaum, Chad
    Gnirke, Andreas
    [J]. GENOME BIOLOGY, 2011, 12 (02)
  • [2] Objective review of de novo stand-alone error correction methods for NGS data
    Alic, Andy S.
    Ruzafa, David
    Dopazo, Joaquin
    Blanquer, Ignacio
    [J]. WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL MOLECULAR SCIENCE, 2016, 6 (02) : 111 - 146
  • [3] Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data
    Allam, Amin
    Kalnis, Panos
    Solovyev, Victor
    [J]. BIOINFORMATICS, 2015, 31 (21) : 3421 - 3428
  • [4] A map of human genome variation from population-scale sequencing
    Altshuler, David
    Durbin, Richard M.
    Abecasis, Goncalo R.
    Bentley, David R.
    Chakravarti, Aravinda
    Clark, Andrew G.
    Collins, Francis S.
    De la Vega, Francisco M.
    Donnelly, Peter
    Egholm, Michael
    Flicek, Paul
    Gabriel, Stacey B.
    Gibbs, Richard A.
    Knoppers, Bartha M.
    Lander, Eric S.
    Lehrach, Hans
    Mardis, Elaine R.
    McVean, Gil A.
    Nickerson, DebbieA.
    Peltonen, Leena
    Schafer, Alan J.
    Sherry, Stephen T.
    Wang, Jun
    Wilson, Richard K.
    Gibbs, Richard A.
    Deiros, David
    Metzker, Mike
    Muzny, Donna
    Reid, Jeff
    Wheeler, David
    Wang, Jun
    Li, Jingxiang
    Jian, Min
    Li, Guoqing
    Li, Ruiqiang
    Liang, Huiqing
    Tian, Geng
    Wang, Bo
    Wang, Jian
    Wang, Wei
    Yang, Huanming
    Zhang, Xiuqing
    Zheng, Huisong
    Lander, Eric S.
    Altshuler, David L.
    Ambrogio, Lauren
    Bloom, Toby
    Cibulskis, Kristian
    Fennell, Tim J.
    Gabriel, Stacey B.
    [J]. NATURE, 2010, 467 (7319) : 1061 - 1073
  • [5] [Anonymous], 2014, GENOMICS PLANT ASS B
  • [6] [Anonymous], PLANT AN GEN 22 C SA
  • [7] [Anonymous], 16 ANN ADV GEN BIOL
  • [8] [Anonymous], 41 ANN S FDN COMP SC
  • [9] [Anonymous], 13 WORKSH ALG BIOINF
  • [10] [Anonymous], PAC BIOSCI