TGS-GapCloser: A fast and accurate gap closer for large genomes with low coverage of error-prone long reads

被引:270
作者
Xu, Mengyang [1 ,2 ,3 ]
Guo, Lidong [1 ,4 ]
Gu, Shengqiang [1 ,4 ]
Wang, Ou [3 ,5 ]
Zhang, Rui [1 ]
Peters, Brock A. [3 ,6 ]
Fan, Guangyi [1 ,3 ]
Liu, Xin [1 ,2 ,3 ,7 ]
Xu, Xun [3 ,7 ]
Deng, Li [1 ,2 ,3 ]
Zhang, Yongwei [3 ,6 ]
机构
[1] BGI Shenzhen, BGI Qingdao, West Coast New Area, 2 Hengyunshan Rd, Qingdao 266426, Peoples R China
[2] BGI Shenzhen, State Key Lab Agr Genom, Bldg 11, Shenzhen 518083, Peoples R China
[3] BGI Shenzhen, Bldg 11, Shenzhen 518083, Peoples R China
[4] Univ Chinese Acad Sci, BGI Educ Ctr, Bldg 11, Shenzhen 518083, Peoples R China
[5] BGI Shenzhen, MGI, Bldg 11, Shenzhen 518083, Peoples R China
[6] Complete Genom Inc, 2904 Orchard Pkwy, San Jose, CA 95134 USA
[7] BGI Shenzhen, China Natl GeneBank, Jinsha Rd, Shenzhen 518120, Peoples R China
关键词
gap closure; third-generation sequencing; genome assembly; ginkgo; MHC;
D O I
10.1093/gigascience/giaa094
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background: Analyses that use genome assemblies are critically affected by the contiguity, completeness, and accuracy of those assemblies. In recent years single-molecule sequencing techniques generating long-read information have become available and enabled substantial improvement in contig length and genome completeness, especially for large genomes (>100 Mb), although bioinformatic tools for these applications are still limited. Findings: We developed a software tool to close sequence gaps in genome assemblies, TGS-GapCloser, that uses low-depth (similar to 10x) long single-molecule reads. The algorithm extracts reads that bridge gap regions between 2 contigs within a scaffold, error corrects only the candidate reads, and assigns the best sequence data to each gap. As a demonstration, we used TGS-GapCloser to improve the scaftig NG50 value of 3 human genome assemblies by 24-fold on average with only similar to 10x coverage of Oxford Nanopore or Pacific Biosciences reads, covering with sequence data up to 94.8% gaps with 97.7% positive predictive value. These improved assemblies achieve 99.998% (Q46) single-base accuracy with final inserted sequences having 99.97% (Q35) accuracy, despite the high raw error rate of single-molecule reads, enabling high-quality downstream analyses, including up to a 31-fold increase in the scaftig NGA50 and up to 13.1% more complete BUSCO genes. Additionally, we show that even in ultra-large genome assemblies, such as the ginkgo (similar to 12 Gb), TGS-GapCloser can cover 71.6% of gaps with sequence data. Conclusions: TGS-GapCloser can close gaps in large genome assemblies using raw long reads quickly and cost-effectively. The final assemblies generated by TGS-GapCloser have improved contiguity and completeness while maintaining high accuracy. The software is available at https://github.com/BGI-Qingdao/TGS-GapCloser.
引用
收藏
页数:11
相关论文
共 46 条
[1]  
Adams MD FC, 1994, AUTOMATED DNA SEQUEN, DOI [10.1016/C2009-0-02360-5, DOI 10.1016/C2009-0-02360-5]
[2]   Hi-C: A comprehensive technique to capture the conformation of genomes [J].
Belton, Jon-Matthew ;
McCord, Rachel Patton ;
Gibcus, Johan Harmen ;
Naumova, Natalia ;
Zhan, Ye ;
Dekker, Job .
METHODS, 2012, 58 (03) :268-276
[3]   SSPACE-LongRead: scaffolding bacterial draft genomes using long read sequence information [J].
Boetzer, Marten ;
Pirovano, Walter .
BMC BIOINFORMATICS, 2014, 15
[4]   Toward almost closed genomes with GapFiller [J].
Boetzer, Marten ;
Pirovano, Walter .
GENOME BIOLOGY, 2012, 13 (06)
[5]   Mapping Bias Overestimates Reference Allele Frequencies at the HLA Genes in the 1000 Genomes Project Phase I Data [J].
Brandt, Debora Y. C. ;
Aguiar, Vitor R. C. ;
Bitarello, Barbara D. ;
Nunes, Kelly ;
Goudet, Jerome ;
Meyer, Diogo .
G3-GENES GENOMES GENETICS, 2015, 5 (05) :931-941
[6]   The potential and challenges of nanopore sequencing [J].
Branton, Daniel ;
Deamer, David W. ;
Marziali, Andre ;
Bayley, Hagan ;
Benner, Steven A. ;
Butler, Thomas ;
Di Ventra, Massimiliano ;
Garaj, Slaven ;
Hibbs, Andrew ;
Huang, Xiaohua ;
Jovanovich, Stevan B. ;
Krstic, Predrag S. ;
Lindsay, Stuart ;
Ling, Xinsheng Sean ;
Mastrangelo, Carlos H. ;
Meller, Amit ;
Oliver, John S. ;
Pershin, Yuriy V. ;
Ramsey, J. Michael ;
Riehn, Robert ;
Soni, Gautam V. ;
Tabard-Cossa, Vincent ;
Wanunu, Meni ;
Wiggin, Matthew ;
Schloss, Jeffery A. .
NATURE BIOTECHNOLOGY, 2008, 26 (10) :1146-1153
[7]   DNA repeats in the human genome [J].
Catasti, P ;
Chen, X ;
Mariappan, SVS ;
Bradbury, EM ;
Gupta, G .
GENETICA, 1999, 106 (1-2) :15-36
[8]   Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory [J].
Chaisson, Mark J. ;
Tesler, Glenn .
BMC BIOINFORMATICS, 2012, 13
[9]   GapBlaster-A Graphical Gap Filler for Prokaryote Genomes [J].
de Sa, Pablo H. C. G. ;
Miranda, Fabio ;
Veras, Adonney ;
de Melo, Diego Magalhaes ;
Soares, Siomar ;
Pinheiro, Kenny ;
Guimaraes, Luis ;
Azevedo, Vasco ;
Silva, Artur ;
Ramos, Rommel T. J. .
PLOS ONE, 2016, 11 (05)
[10]  
Deng L, 2019, SLR SUPERSCAFFOLDER