Comparing methods for constructing and representing human pangenome graphs

被引:21
作者
Andreace, Francesco [1 ,2 ]
Lechat, Pierre [3 ]
Dufresne, Yoann [1 ,3 ]
Chikhi, Rayan [1 ]
机构
[1] Univ Paris Cite, Inst Pasteur, Dept Computat Biol, F-75015 Paris, France
[2] Sorbonne Univ, Coll Doctoral, F-75005 Paris, France
[3] Univ Paris, Inst Pasteur, Bioinformat & Biostat Hub, F-75015 Paris, France
关键词
Pangenomics; de Bruijn graphs; Variation graphs; Sequence analysis; Algorithms; GENOME; EFFICIENT;
D O I
10.1186/s13059-023-03098-2
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
BackgroundAs a single reference genome cannot possibly represent all the variation present across human individuals, pangenome graphs have been introduced to incorporate population diversity within a wide range of genomic analyses. Several data structures have been proposed for representing collections of genomes as pangenomes, in particular graphs.ResultsIn this work, we collect all publicly available high-quality human haplotypes and construct the largest human pangenome graphs to date, incorporating 52 individuals in addition to two synthetic references (CHM13 and GRCh38). We build variation graphs and de Bruijn graphs of this collection using five of the state-of-the-art tools: Bifrost, mdbg, Minigraph, Minigraph-Cactus and pggb. We examine differences in the way each of these tools represents variations between input sequences, both in terms of overall graph structure and representation of specific genetic loci.ConclusionThis work sheds light on key differences between pangenome graph representations, informing end-users on how to select the most appropriate graph type for their application.
引用
收藏
页数:19
相关论文
共 36 条
[1]  
Andreace F., 2023, Zenodo sourcecode, DOI [10.5281/zenodo.8370336, DOI 10.5281/ZENODO.8370336]
[2]  
Andreace F, 2023, Github sourcecode
[3]   Progressive Cactus is a multiple-genome aligner for the thousand-genome era [J].
Armstrong, Joel ;
Hickey, Glenn ;
Diekhans, Mark ;
Fiddes, Ian T. ;
Novak, Adam M. ;
Deran, Alden ;
Fang, Qi ;
Xie, Duo ;
Feng, Shaohong ;
Stiller, Josefin ;
Genereux, Diane ;
Johnson, Jeremy ;
Marinescu, Voichita Dana ;
Alfoldi, Jessica ;
Harris, Robert S. ;
Lindblad-Toh, Kerstin ;
Haussler, David ;
Karlsson, Elinor ;
Jarvis, Erich D. ;
Zhang, Guojie ;
Paten, Benedict .
NATURE, 2020, 587 (7833) :246-+
[4]  
Baid G, 2023, Dataset. Google Brain Assemblies
[5]   DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer [J].
Baid, Gunjan ;
Cook, Daniel E. ;
Shafin, Kishwar ;
Yun, Taedong ;
Llinares-Lopez, Felipe ;
Berthet, Quentin ;
Belyaeva, Anastasiya ;
Topfer, Armin ;
Wenger, Aaron M. ;
Rowell, William J. ;
Yang, Howard ;
Kolesnikov, Alexey ;
Ammar, Waleed ;
Vert, Jean-Philippe ;
Vaswani, Ashish ;
McLean, Cory Y. ;
Nattestad, Maria ;
Chang, Pi-Chuan ;
Carroll, Andrew .
NATURE BIOTECHNOLOGY, 2023, 41 (02) :232-+
[6]  
Chin CS, 2022, bioRxiv, DOI [10.1101/2022.06.08.495395, 10.1101/2022.06.08.495395, DOI 10.1101/2022.06.08.495395, 10.1101/2022.06.08.495395v2]
[7]   HLA variation and disease [J].
Dendrou, Calliope A. ;
Petersen, Jan ;
Rossjohn, Jamie ;
Fugger, Lars .
NATURE REVIEWS IMMUNOLOGY, 2018, 18 (05) :325-339
[8]  
Doerr D, 2021, Gfaffix identifies walk-preserving shared affixes in variation graphs and collapses them into a non-redundant graph structure
[9]   Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes [J].
Ebler, Jana ;
Ebert, Peter ;
Clarke, Wayne E. ;
Rausch, Tobias ;
Audano, Peter A. ;
Houwaart, Torsten ;
Mao, Yafei ;
Korbel, Jan O. ;
Eichler, Evan E. ;
Zody, Michael C. ;
Dilthey, Alexander T. ;
Marschall, Tobias .
NATURE GENETICS, 2022, 54 (04) :518-+
[10]   Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer [J].
Ekim, Baris ;
Berger, Bonnie ;
Chikhi, Rayan .
CELL SYSTEMS, 2021, 12 (10) :958-+