TGS-GapCloser: A fast and accurate gap closer for large genomes with low coverage of error-prone long reads

被引:214
作者
Xu, Mengyang [1 ,2 ,3 ]
Guo, Lidong [1 ,4 ]
Gu, Shengqiang [1 ,4 ]
Wang, Ou [3 ,5 ]
Zhang, Rui [1 ]
Peters, Brock A. [3 ,6 ]
Fan, Guangyi [1 ,3 ]
Liu, Xin [1 ,2 ,3 ,7 ]
Xu, Xun [3 ,7 ]
Deng, Li [1 ,2 ,3 ]
Zhang, Yongwei [3 ,6 ]
机构
[1] BGI Shenzhen, BGI Qingdao, West Coast New Area, 2 Hengyunshan Rd, Qingdao 266426, Peoples R China
[2] BGI Shenzhen, State Key Lab Agr Genom, Bldg 11, Shenzhen 518083, Peoples R China
[3] BGI Shenzhen, Bldg 11, Shenzhen 518083, Peoples R China
[4] Univ Chinese Acad Sci, BGI Educ Ctr, Bldg 11, Shenzhen 518083, Peoples R China
[5] BGI Shenzhen, MGI, Bldg 11, Shenzhen 518083, Peoples R China
[6] Complete Genom Inc, 2904 Orchard Pkwy, San Jose, CA 95134 USA
[7] BGI Shenzhen, China Natl GeneBank, Jinsha Rd, Shenzhen 518120, Peoples R China
来源
GIGASCIENCE | 2020年 / 9卷 / 09期
关键词
gap closure; third-generation sequencing; genome assembly; ginkgo; MHC;
D O I
10.1093/gigascience/giaa094
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background: Analyses that use genome assemblies are critically affected by the contiguity, completeness, and accuracy of those assemblies. In recent years single-molecule sequencing techniques generating long-read information have become available and enabled substantial improvement in contig length and genome completeness, especially for large genomes (>100 Mb), although bioinformatic tools for these applications are still limited. Findings: We developed a software tool to close sequence gaps in genome assemblies, TGS-GapCloser, that uses low-depth (similar to 10x) long single-molecule reads. The algorithm extracts reads that bridge gap regions between 2 contigs within a scaffold, error corrects only the candidate reads, and assigns the best sequence data to each gap. As a demonstration, we used TGS-GapCloser to improve the scaftig NG50 value of 3 human genome assemblies by 24-fold on average with only similar to 10x coverage of Oxford Nanopore or Pacific Biosciences reads, covering with sequence data up to 94.8% gaps with 97.7% positive predictive value. These improved assemblies achieve 99.998% (Q46) single-base accuracy with final inserted sequences having 99.97% (Q35) accuracy, despite the high raw error rate of single-molecule reads, enabling high-quality downstream analyses, including up to a 31-fold increase in the scaftig NGA50 and up to 13.1% more complete BUSCO genes. Additionally, we show that even in ultra-large genome assemblies, such as the ginkgo (similar to 12 Gb), TGS-GapCloser can cover 71.6% of gaps with sequence data. Conclusions: TGS-GapCloser can close gaps in large genome assemblies using raw long reads quickly and cost-effectively. The final assemblies generated by TGS-GapCloser have improved contiguity and completeness while maintaining high accuracy. The software is available at https://github.com/BGI-Qingdao/TGS-GapCloser.
引用
收藏
页数:11
相关论文
共 46 条
  • [11] An assessment of the sequence gaps: Unfinished business in a finished human genome
    Eichler, EE
    Clark, RA
    She, XW
    [J]. NATURE REVIEWS GENETICS, 2004, 5 (05) : 345 - 354
  • [12] Reassessing the Determinants of Breeding Synchrony in Ungulates
    English, Annie K.
    Chauvenet, Alienor L. M.
    Safi, Kamran
    Pettorelli, Nathalie
    [J]. PLOS ONE, 2012, 7 (07):
  • [13] Gao S, 2012, FINIS IMPROVED SILIC, P314
  • [14] Composition-based statistics and translated nucleotide searches:: Improving the TBLASTN module of BLAST
    Gertz, E. Michael
    Yu, Yi-Kuo
    Agarwala, Richa
    Schaffer, Alejandro A.
    Altschul, Stephen F.
    [J]. BMC BIOLOGY, 2006, 4 (1)
  • [15] Draft genome of the living fossil Ginkgo biloba
    Guan, Rui
    Zhao, Yunpeng
    Zhang, He
    Fan, Guangyi
    Liu, Xin
    Zhou, Wenbin
    Shi, Chengcheng
    Wang, Jiahao
    Liu, Weiqing
    Liang, Xinming
    Fu, Yuanyuan
    Ma, Kailong
    Zhao, Lijun
    Zhang, Fumin
    Lu, Zuhong
    Lee, Simon Ming-Yuen
    Xu, Xun
    Wang, Jian
    Yang, Huanming
    Fu, Chengxin
    Ge, Song
    Chen, Wenbin
    [J]. GIGASCIENCE, 2016, 5
  • [16] QUAST: quality assessment tool for genome assemblies
    Gurevich, Alexey
    Saveliev, Vladislav
    Vyahhi, Nikolay
    Tesler, Glenn
    [J]. BIOINFORMATICS, 2013, 29 (08) : 1072 - 1075
  • [17] Nanopore sequencing and assembly of a human genome with ultra-long reads
    Jain, Miten
    Koren, Sergey
    Miga, Karen H.
    Quick, Josh
    Rand, Arthur C.
    Sasani, Thomas A.
    Tyson, John R.
    Beggs, Andrew D.
    Dilthey, Alexander T.
    Fiddes, Ian T.
    Malla, Sunir
    Marriott, Hannah
    Nieto, Tom
    O'Grady, Justin
    Olsen, Hugh E.
    Pedersen, Brent S.
    Rhie, Arang
    Richardson, Hollian
    Quinlan, Aaron R.
    Snutch, Terrance P.
    Tee, Louise
    Paten, Benedict
    Phillippy, Adam M.
    Simpson, Jared T.
    Loman, Nicholas J.
    Loose, Matthew
    [J]. NATURE BIOTECHNOLOGY, 2018, 36 (04) : 338 - +
  • [18] Whole-genome haplotyping by dilution, amplification, and sequencing
    Kaper, Fiona
    Swamy, Sajani
    Klotzle, Brandy
    Munchel, Sarah
    Cottrell, Joseph
    Bibikova, Marina
    Chuang, Han-Yu
    Kruglyak, Semyon
    Ronaghi, Mostafa
    Eberle, Michael A.
    Fan, Jian-Bing
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2013, 110 (14) : 5552 - 5557
  • [19] Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation
    Koren, Sergey
    Walenz, Brian P.
    Berlin, Konstantin
    Miller, Jason R.
    Bergman, Nicholas H.
    Phillippy, Adam M.
    [J]. GENOME RESEARCH, 2017, 27 (05) : 722 - 736
  • [20] GMcloser: closing gaps in assemblies accurately with a likelihood-based selection of contig or long-read alignments
    Kosugi, Shunichi
    Hirakawa, Hideki
    Tabata, Satoshi
    [J]. BIOINFORMATICS, 2015, 31 (23) : 3733 - 3741