Discovery and genotyping of structural variation from long-read haploid genome sequence data

被引:236
|
作者
Huddleston, John [1 ,2 ]
Chaisson, Mark J. P. [1 ]
Steinberg, Karyn Meltz [3 ]
Warren, Wes [3 ]
Hoekzema, Kendra [1 ]
Gordon, David [1 ,2 ]
Graves-Lindsay, Tina A. [3 ]
Munson, Katherine M. [1 ]
Kronenberg, Zev N. [1 ]
Vives, Laura [1 ]
Peluso, Paul [4 ]
Boitano, Matthew [4 ]
Chin, Chen-Shin [4 ]
Korlach, Jonas [4 ]
Wilson, Richard K. [5 ]
Eichler, Evan E. [1 ,2 ]
机构
[1] Univ Washington, Sch Med, Dept Genome Sci, Seattle, WA 98195 USA
[2] Univ Washington, Howard Hughes Med Inst, Seattle, WA 98195 USA
[3] Washington Univ, Sch Med, McDonnell Genome Inst, Dept Med,Dept Genet, St Louis, MO 63108 USA
[4] Pacific Biosci Calif Inc, Menlo Pk, CA 94025 USA
[5] Univ Pittsburgh, Dept Pathol, Pittsburgh, PA 15261 USA
基金
美国国家卫生研究院;
关键词
COPY NUMBER VARIATION; FRAMEWORK; RESOURCE; ORIGIN;
D O I
10.1101/gr.214007.116
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
In an effort to more fully understand the full spectrum of human genetic variation, we generated deep single-molecule, real-time (SMRT) sequencing data from two haploid human genomes. By using an assembly-based approach (SMRT-SV), we systematically assessed each genome independently for structural variants (SVs) and indels resolving the sequence structure of 461,553 genetic variants from 2 bp to 28 kbp in length. We find that >89% of these variants have been missed as part of analysis of the 1000 Genomes Project even after adjusting for more common variants (MAF >1%). We estimate that this theoretical human diploid differs by as much as similar to 16 Mbp with respect to the human reference, with long-read sequencing data providing a fivefold increase in sensitivity for genetic variants ranging in size from 7 bp to 1 kbp compared with short-read sequence data. Although a large fraction of genetic variants were not detected by short-read approaches, once the alternate allele is sequence-resolved, we show that 61% of SVs can be genotyped in short-read sequence data sets with high accuracy. Uncoupling discovery fromgenotyping thus allows for the majority of this missed common variation to be genotyped in the human population. Interestingly, when we repeat SV detection on a pseudodiploid genome constructed in silico by merging the two haploids, we find that similar to 59% of the heterozygous SVs are no longer detected by SMRT-SV. These results indicate that haploid resolution of long-read sequencing data will significantly increase sensitivity of SV detection.
引用
收藏
页码:677 / 685
页数:9
相关论文
共 50 条
  • [41] Long-read sequence and assembly of segmental duplications
    Mitchell R. Vollger
    Philip C. Dishuck
    Melanie Sorensen
    AnneMarie E. Welch
    Vy Dang
    Max L. Dougherty
    Tina A. Graves-Lindsay
    Richard K. Wilson
    Mark J. P. Chaisson
    Evan E. Eichler
    Nature Methods, 2019, 16 : 88 - 94
  • [42] Improved Whole-Genome Sequence of Phytophthora capsici Generated by Long-Read Sequencing
    Shi, Jinxia
    Ye, Wenwu
    Ma, Dongfang
    Yin, Junliang
    Zhang, Zhichao
    Wang, Yuanchao
    Qiao, Yongli
    MOLECULAR PLANT-MICROBE INTERACTIONS, 2021, 34 (07) : 866 - 869
  • [43] Integrative genotyping of cancer and immune phenotypes by long-read sequencing
    Livius Penter
    Mehdi Borji
    Adi Nagler
    Haoxiang Lyu
    Wesley S. Lu
    Nicoletta Cieri
    Katie Maurer
    Giacomo Oliveira
    Aziz M. Al’Khafaji
    Kiran V. Garimella
    Shuqiang Li
    Donna S. Neuberg
    Jerome Ritz
    Robert J. Soiffer
    Jacqueline S. Garcia
    Kenneth J. Livak
    Catherine J. Wu
    Nature Communications, 15 (1)
  • [44] Klumpy: A tool to evaluate the integrity of long-read genome assemblies and illusive sequence motifs
    Madrigal, Giovanni
    Minhas, Bushra Fazal
    Catchen, Julian
    MOLECULAR ECOLOGY RESOURCES, 2025, 25 (01)
  • [45] Integrative genotyping of cancer and immune phenotypes by long-read sequencing
    Penter, Livius
    Borji, Mehdi
    Nagler, Adi
    Lyu, Haoxiang
    Lu, Wesley S.
    Cieri, Nicoletta
    Maurer, Katie
    Oliveira, Giacomo
    Al'Khafaji, Aziz M.
    Garimella, Kiran, V
    Li, Shuqiang
    Neuberg, Donna S.
    Ritz, Jerome
    Soiffer, Robert J.
    Garcia, Jacqueline S.
    Livak, Kenneth J.
    Wu, Catherine J.
    NATURE COMMUNICATIONS, 2024, 15 (01)
  • [46] Disentangling cobionts and contamination in long-read genomic data using sequence composition
    Weber, Claudia C.
    G3-GENES GENOMES GENETICS, 2024, 14 (11):
  • [47] SVcnn: an accurate deep learning-based method for detecting structural variation based on long-read data
    Yan Zheng
    Xuequn Shang
    BMC Bioinformatics, 24
  • [48] SVcnn: an accurate deep learning-based method for detecting structural variation based on long-read data
    Zheng, Yan
    Shang, Xuequn
    BMC BIOINFORMATICS, 2023, 24 (01)
  • [49] Recurrent miscalling of missense variation from short-read genome sequence data
    Field, Matthew A.
    Burgio, Gaetan
    Chuah, Aaron
    Al Shekaili, Jalila
    Hassan, Batool
    Al Sukaiti, Nashat
    Foote, Simon J.
    Cook, Matthew C.
    Andrews, T. Daniel
    BMC GENOMICS, 2019, 20 (Suppl 8)
  • [50] Recurrent miscalling of missense variation from short-read genome sequence data
    Matthew A. Field
    Gaetan Burgio
    Aaron Chuah
    Jalila Al Shekaili
    Batool Hassan
    Nashat Al Sukaiti
    Simon J. Foote
    Matthew C. Cook
    T. Daniel Andrews
    BMC Genomics, 20