TGS-GapCloser: A fast and accurate gap closer for large genomes with low coverage of error-prone long reads

被引:214
作者
Xu, Mengyang [1 ,2 ,3 ]
Guo, Lidong [1 ,4 ]
Gu, Shengqiang [1 ,4 ]
Wang, Ou [3 ,5 ]
Zhang, Rui [1 ]
Peters, Brock A. [3 ,6 ]
Fan, Guangyi [1 ,3 ]
Liu, Xin [1 ,2 ,3 ,7 ]
Xu, Xun [3 ,7 ]
Deng, Li [1 ,2 ,3 ]
Zhang, Yongwei [3 ,6 ]
机构
[1] BGI Shenzhen, BGI Qingdao, West Coast New Area, 2 Hengyunshan Rd, Qingdao 266426, Peoples R China
[2] BGI Shenzhen, State Key Lab Agr Genom, Bldg 11, Shenzhen 518083, Peoples R China
[3] BGI Shenzhen, Bldg 11, Shenzhen 518083, Peoples R China
[4] Univ Chinese Acad Sci, BGI Educ Ctr, Bldg 11, Shenzhen 518083, Peoples R China
[5] BGI Shenzhen, MGI, Bldg 11, Shenzhen 518083, Peoples R China
[6] Complete Genom Inc, 2904 Orchard Pkwy, San Jose, CA 95134 USA
[7] BGI Shenzhen, China Natl GeneBank, Jinsha Rd, Shenzhen 518120, Peoples R China
来源
GIGASCIENCE | 2020年 / 9卷 / 09期
关键词
gap closure; third-generation sequencing; genome assembly; ginkgo; MHC;
D O I
10.1093/gigascience/giaa094
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background: Analyses that use genome assemblies are critically affected by the contiguity, completeness, and accuracy of those assemblies. In recent years single-molecule sequencing techniques generating long-read information have become available and enabled substantial improvement in contig length and genome completeness, especially for large genomes (>100 Mb), although bioinformatic tools for these applications are still limited. Findings: We developed a software tool to close sequence gaps in genome assemblies, TGS-GapCloser, that uses low-depth (similar to 10x) long single-molecule reads. The algorithm extracts reads that bridge gap regions between 2 contigs within a scaffold, error corrects only the candidate reads, and assigns the best sequence data to each gap. As a demonstration, we used TGS-GapCloser to improve the scaftig NG50 value of 3 human genome assemblies by 24-fold on average with only similar to 10x coverage of Oxford Nanopore or Pacific Biosciences reads, covering with sequence data up to 94.8% gaps with 97.7% positive predictive value. These improved assemblies achieve 99.998% (Q46) single-base accuracy with final inserted sequences having 99.97% (Q35) accuracy, despite the high raw error rate of single-molecule reads, enabling high-quality downstream analyses, including up to a 31-fold increase in the scaftig NGA50 and up to 13.1% more complete BUSCO genes. Additionally, we show that even in ultra-large genome assemblies, such as the ginkgo (similar to 12 Gb), TGS-GapCloser can cover 71.6% of gaps with sequence data. Conclusions: TGS-GapCloser can close gaps in large genome assemblies using raw long reads quickly and cost-effectively. The final assemblies generated by TGS-GapCloser have improved contiguity and completeness while maintaining high accuracy. The software is available at https://github.com/BGI-Qingdao/TGS-GapCloser.
引用
收藏
页数:11
相关论文
共 46 条
  • [1] Adams MD FC, 1994, AUTOMATED DNA SEQUEN, DOI [10.1016/C2009-0-02360-5, DOI 10.1016/C2009-0-02360-5]
  • [2] Hi-C: A comprehensive technique to capture the conformation of genomes
    Belton, Jon-Matthew
    McCord, Rachel Patton
    Gibcus, Johan Harmen
    Naumova, Natalia
    Zhan, Ye
    Dekker, Job
    [J]. METHODS, 2012, 58 (03) : 268 - 276
  • [3] SSPACE-LongRead: scaffolding bacterial draft genomes using long read sequence information
    Boetzer, Marten
    Pirovano, Walter
    [J]. BMC BIOINFORMATICS, 2014, 15
  • [4] Toward almost closed genomes with GapFiller
    Boetzer, Marten
    Pirovano, Walter
    [J]. GENOME BIOLOGY, 2012, 13 (06):
  • [5] Mapping Bias Overestimates Reference Allele Frequencies at the HLA Genes in the 1000 Genomes Project Phase I Data
    Brandt, Debora Y. C.
    Aguiar, Vitor R. C.
    Bitarello, Barbara D.
    Nunes, Kelly
    Goudet, Jerome
    Meyer, Diogo
    [J]. G3-GENES GENOMES GENETICS, 2015, 5 (05): : 931 - 941
  • [6] The potential and challenges of nanopore sequencing
    Branton, Daniel
    Deamer, David W.
    Marziali, Andre
    Bayley, Hagan
    Benner, Steven A.
    Butler, Thomas
    Di Ventra, Massimiliano
    Garaj, Slaven
    Hibbs, Andrew
    Huang, Xiaohua
    Jovanovich, Stevan B.
    Krstic, Predrag S.
    Lindsay, Stuart
    Ling, Xinsheng Sean
    Mastrangelo, Carlos H.
    Meller, Amit
    Oliver, John S.
    Pershin, Yuriy V.
    Ramsey, J. Michael
    Riehn, Robert
    Soni, Gautam V.
    Tabard-Cossa, Vincent
    Wanunu, Meni
    Wiggin, Matthew
    Schloss, Jeffery A.
    [J]. NATURE BIOTECHNOLOGY, 2008, 26 (10) : 1146 - 1153
  • [7] DNA repeats in the human genome
    Catasti, P
    Chen, X
    Mariappan, SVS
    Bradbury, EM
    Gupta, G
    [J]. GENETICA, 1999, 106 (1-2) : 15 - 36
  • [8] Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory
    Chaisson, Mark J.
    Tesler, Glenn
    [J]. BMC BIOINFORMATICS, 2012, 13
  • [9] GapBlaster-A Graphical Gap Filler for Prokaryote Genomes
    de Sa, Pablo H. C. G.
    Miranda, Fabio
    Veras, Adonney
    de Melo, Diego Magalhaes
    Soares, Siomar
    Pinheiro, Kenny
    Guimaraes, Luis
    Azevedo, Vasco
    Silva, Artur
    Ramos, Rommel T. J.
    [J]. PLOS ONE, 2016, 11 (05):
  • [10] Deng L, 2019, SLR SUPERSCAFFOLDER