Comparative genome analysis using sample-specific string detection in accurate long reads

被引:6
作者
Khorsand, Parsoa [1 ]
Denti, Luca [2 ]
Bonizzoni, Paola [3 ]
Chikhi, Rayan [2 ]
Hormozdiari, Fereydoun [1 ,4 ,5 ]
机构
[1] Univ Calif Davis, Genome Ctr, Davis, CA 95616 USA
[2] Inst Pasteur, Dept Computat Biol, F-75015 Paris, France
[3] Univ Milano Bicocca, Dept Informat Syst & Commun, I-20126 Milan, Italy
[4] Univ Calif Davis, MIND Inst, Sacramento, CA 95817 USA
[5] Univ Calif Davis, Dept Biochem & Mol Med, Sacramento, CA 95817 USA
来源
BIOINFORMATICS ADVANCES | 2021年 / 1卷 / 01期
基金
欧盟地平线“2020”;
关键词
DE-NOVO; DIVERSITY;
D O I
10.1093/bioadv/vbab005
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Motivation Comparative genome analysis of two or more whole-genome sequenced (WGS) samples is at the core of most applications in genomics. These include the discovery of genomic differences segregating in populations, case-control analysis in common diseases and diagnosing rare disorders. With the current progress of accurate long-read sequencing technologies (e.g. circular consensus sequencing from PacBio sequencers), we can dive into studying repeat regions of the genome (e.g. segmental duplications) and hard-to-detect variants (e.g. complex structural variants). Results We propose a novel framework for comparative genome analysis through the discovery of strings that are specific to one genome ('samples-specific' strings). We have developed a novel, accurate and efficient computational method for the discovery of sample-specific strings between two groups of WGS samples. The proposed approach will give us the ability to perform comparative genome analysis without the need to map the reads and is not hindered by shortcomings of the reference genome and mapping algorithms. We show that the proposed approach is capable of accurately finding sample-specific strings representing nearly all variation (>98%) reported across pairs or trios of WGS samples using accurate long reads (e.g. PacBio HiFi data).
引用
收藏
页数:9
相关论文
共 50 条
  • [1] Dindel: Accurate indel calls from short-read data
    Albers, Cornelis A.
    Lunter, Gerton
    MacArthur, Daniel G.
    McVean, Gilean
    Ouwehand, Willem H.
    Durbin, Richard
    [J]. GENOME RESEARCH, 2011, 21 (06) : 961 - 973
  • [2] A global reference for human genetic variation
    Altshuler, David M.
    Durbin, Richard M.
    Abecasis, Goncalo R.
    Bentley, David R.
    Chakravarti, Aravinda
    Clark, Andrew G.
    Donnelly, Peter
    Eichler, Evan E.
    Flicek, Paul
    Gabriel, Stacey B.
    Gibbs, Richard A.
    Green, Eric D.
    Hurles, Matthew E.
    Knoppers, Bartha M.
    Korbel, Jan O.
    Lander, Eric S.
    Lee, Charles
    Lehrach, Hans
    Mardis, Elaine R.
    Marth, Gabor T.
    McVean, Gil A.
    Nickerson, Deborah A.
    Wang, Jun
    Wilson, Richard K.
    Boerwinkle, Eric
    Doddapaneni, Harsha
    Han, Yi
    Korchina, Viktoriya
    Kovar, Christie
    Lee, Sandra
    Muzny, Donna
    Reid, Jeffrey G.
    Zhu, Yiming
    Chang, Yuqi
    Feng, Qiang
    Fang, Xiaodong
    Guo, Xiaosen
    Jian, Min
    Jiang, Hui
    Jin, Xin
    Lan, Tianming
    Li, Guoqing
    Li, Jingxiang
    Li, Yingrui
    Liu, Shengmao
    Liu, Xiao
    Lu, Yao
    Ma, Xuedi
    Tang, Meifang
    Wang, Bo
    [J]. NATURE, 2015, 526 (7571) : 68 - +
  • [3] DE-kupl: exhaustive capture of biological variation in RNA-seq data through k-mer decomposition
    Audoux, Jerome
    Philippe, Nicolas
    Chikhi, Rayan
    Salson, Mikael
    Gallopin, Melina
    Gabriel, Marc
    Le Coz, Jeremy
    Drouineau, Emilie
    Commes, Therese
    Gautheret, Daniel
    [J]. GENOME BIOLOGY, 2017, 18
  • [4] Variable number tandem repeats mediate the expression of proximal genes
    Bakhtiari, Mehrdad
    Park, Jonghun
    Ding, Yuan-Chun
    Shleizer-Burko, Sharona
    Neuhausen, Susan L.
    Halldorsson, Bjarni V.
    Stefansson, Kari
    Gymrek, Melissa
    Bafna, Vineet
    [J]. NATURE COMMUNICATIONS, 2021, 12 (01)
  • [5] Targeted genotyping of variable number tandem repeats with adVNTR
    Bakhtiari, Mehrdad
    Shleizer-Burko, Sharona
    Gymrek, Melissa
    Bansal, Vikas
    Bafna, Vineet
    [J]. GENOME RESEARCH, 2018, 28 (11) : 1709 - 1719
  • [6] Linear-time String Indexing and Analysis in Small Space
    Belazzougui, Djamal
    Cunial, Fabio
    Karkkainen, Juha
    Makinen, Veli
    [J]. ACM TRANSACTIONS ON ALGORITHMS, 2020, 16 (02)
  • [7] Bushnell B., 2014, Technical report
  • [8] Automated assembly of centromeres from ultra-long error-prone reads
    Bzikadze, Andrey, V
    Pevzner, Pavel A.
    [J]. NATURE BIOTECHNOLOGY, 2020, 38 (11) : 1309 - +
  • [9] Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software
    Cameron, Daniel L.
    Di Stefano, Leon
    Papenfuss, Anthony T.
    [J]. NATURE COMMUNICATIONS, 2019, 10 (1)
  • [10] Multi-platform discovery of haplotype-resolved structural variation in human genomes
    Chaisson, Mark J. P.
    Sanders, Ashley D.
    Zhao, Xuefang
    Malhotra, Ankit
    Porubsky, David
    Rausch, Tobias
    Gardner, Eugene J.
    Rodriguez, Oscar L.
    Guo, Li
    Collins, Ryan L.
    Fan, Xian
    Wen, Jia
    Handsaker, Robert E.
    Fairley, Susan
    Kronenberg, Zev N.
    Kong, Xiangmeng
    Hormozdiari, Fereydoun
    Lee, Dillon
    Wenger, Aaron M.
    Hastie, Alex R.
    Antaki, Danny
    Anantharaman, Thomas
    Audano, Peter A.
    Brand, Harrison
    Cantsilieris, Stuart
    Cao, Han
    Cerveira, Eliza
    Chen, Chong
    Chen, Xintong
    Chin, Chen-Shan
    Chong, Zechen
    Chuang, Nelson T.
    Lambert, Christine C.
    Church, Deanna M.
    Clarke, Laura
    Farrell, Andrew
    Flores, Joey
    Galeev, Timur
    Gorkin, David U.
    Gujral, Madhusudan
    Guryev, Victor
    Heaton, William Haynes
    Korlach, Jonas
    Kumar, Sushant
    Kwon, Jee Young
    Lam, Ernest T.
    Lee, Jong Eun
    Lee, Joyce
    Lee, Wan-Ping
    Lee, Sau Peng
    [J]. NATURE COMMUNICATIONS, 2019, 10 (1)