Haplotype-aware Variant Selection for Genome Graphs

被引:1
作者
Tavakoli, Neda [1 ]
Gibney, Daniel [1 ]
Aluru, Srinivas [1 ]
机构
[1] Georgia Inst Technol, Sch Computat Sci & Engn, Atlanta, GA 30332 USA
来源
13TH ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND HEALTH INFORMATICS, BCB 2022 | 2022年
基金
美国国家科学基金会;
关键词
Variation graphs; variant selection; haplotype-aware; SNPs; ILP-based optimization; FRAMEWORK; ALGORITHM; ALIGNMENT;
D O I
10.1145/3535508.3545556
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Graph-based genome representations have proven to be a powerful tool in genomic analysis due to their ability to encode variations found in multiple haplotypes and capture population genetic diversity. Such graphs also unavoidably contain paths which switch between haplotypes (i.e., recombinant paths) and thus do not fully match any of the constituent haplotypes. The number of such recombinant paths increases combinatorially with path length and cause inefficiencies and false positives when mapping reads. In this paper, we study the problem of finding reduced haplotypeaware genome graphs that incorporate only a selected subset of variants, yet contain paths corresponding to all alpha -long substrings of the input haplotypes (i.e., non-recombinant paths) with at most delta mismatches. Solving this problem optimally, i.e., minimizing the number of variants selected, is previously known to be NP-hard [14]. Here, we first establish several inapproximability results regarding finding haplotype-aware reduced variation graphs of optimal size. We then present an integer linear programming (ILP) formulation for solving the problem, and experimentally demonstrate this is a computationally feasible approach for real-world problems and provides far superior reduction compared to prior approaches.
引用
收藏
页数:9
相关论文
共 38 条
  • [1] A global reference for human genetic variation
    Altshuler, David M.
    Durbin, Richard M.
    Abecasis, Goncalo R.
    Bentley, David R.
    Chakravarti, Aravinda
    Clark, Andrew G.
    Donnelly, Peter
    Eichler, Evan E.
    Flicek, Paul
    Gabriel, Stacey B.
    Gibbs, Richard A.
    Green, Eric D.
    Hurles, Matthew E.
    Knoppers, Bartha M.
    Korbel, Jan O.
    Lander, Eric S.
    Lee, Charles
    Lehrach, Hans
    Mardis, Elaine R.
    Marth, Gabor T.
    McVean, Gil A.
    Nickerson, Deborah A.
    Wang, Jun
    Wilson, Richard K.
    Boerwinkle, Eric
    Doddapaneni, Harsha
    Han, Yi
    Korchina, Viktoriya
    Kovar, Christie
    Lee, Sandra
    Muzny, Donna
    Reid, Jeffrey G.
    Zhu, Yiming
    Chang, Yuqi
    Feng, Qiang
    Fang, Xiaodong
    Guo, Xiaosen
    Jian, Min
    Jiang, Hui
    Jin, Xin
    Lan, Tianming
    Li, Guoqing
    Li, Jingxiang
    Li, Yingrui
    Liu, Shengmao
    Liu, Xiao
    Lu, Yao
    Ma, Xuedi
    Tang, Meifang
    Wang, Bo
    [J]. NATURE, 2015, 526 (7571) : 68 - +
  • [2] Distance indexing and seed clustering in sequence graphs
    Chang, Xian
    Eizenga, Jordan
    Novak, Adam M.
    Siren, Jouni
    Paten, Benedict
    [J]. BIOINFORMATICS, 2020, 36 : 146 - 153
  • [3] The variant call format and VCFtools
    Danecek, Petr
    Auton, Adam
    Abecasis, Goncalo
    Albers, Cornelis A.
    Banks, Eric
    DePristo, Mark A.
    Handsaker, Robert E.
    Lunter, Gerton
    Marth, Gabor T.
    Sherry, Stephen T.
    McVean, Gilean
    Durbin, Richard
    [J]. BIOINFORMATICS, 2011, 27 (15) : 2156 - 2158
  • [4] Vargas: heuristic-free alignment for assessing linear and graph read aligners
    Darby, Charlotte A.
    Gaddipati, Ravi
    Schatz, Michael C.
    Langmead, Ben
    [J]. BIOINFORMATICS, 2020, 36 (12) : 3712 - 3718
  • [5] A new multilayered PCP and the hardness of hypergraph vertex cover
    Dinur, I
    Guruswami, V
    Khot, S
    Regev, O
    [J]. SIAM JOURNAL ON COMPUTING, 2005, 34 (05) : 1129 - 1146
  • [6] Graphtyper enables population-scale genotyping using pangenome graphs
    Eggertsson, Hannes P.
    Jonsson, Hakon
    Kristmundsdottir, Snaedis
    Hjartarson, Eirikur
    Kehr, Birte
    Masson, Gisli
    Zink, Florian
    Hjorleifsson, Kristjan E.
    Jonasdottir, Aslaug
    Jonasdottir, Adalbjorg
    Jonsdottir, Ingileif
    Gudbjartsson, Daniel F.
    Melsted, Pall
    Stefansson, Kari
    Halldorsson, Bjarni V.
    [J]. NATURE GENETICS, 2017, 49 (11) : 1654 - +
  • [7] Pangenome Graphs
    Eizenga, Jordan M.
    Novak, Adam M.
    Sibbesen, Jonas A.
    Heumos, Simon
    Ghaffaari, Ali
    Hickey, Glenn
    Chang, Xian
    Seaman, Josiah D.
    Rounthwaite, Robin
    Ebler, Jana
    Rautiainen, Mikko
    Garg, Shilpa
    Paten, Benedict
    Marschall, Tobias
    Siren, Jouni
    Garrison, Erik
    [J]. ANNUAL REVIEW OF GENOMICS AND HUMAN GENETICS, VOL 21, 2020, 2020, 21 : 139 - 162
  • [8] A threshold of in n for approximating set cover
    Feige, U
    [J]. JOURNAL OF THE ACM, 1998, 45 (04) : 634 - 652
  • [9] Variation graph toolkit improves read mapping by representing genetic variation in the reference
    Garrison, Erik
    Siren, Jouni
    Novak, Adam M.
    Hickey, Glenn
    Eizenga, Jordan M.
    Dawson, Eric T.
    Jones, William
    Garg, Shilpa
    Markello, Charles
    Lin, Michael F.
    Paten, Benedict
    Durbin, Richard
    [J]. NATURE BIOTECHNOLOGY, 2018, 36 (09) : 875 - +
  • [10] Fully-sensitive seed finding in sequence graphs using a hybrid index
    Ghaffaari, Ali
    Marschall, Tobias
    [J]. BIOINFORMATICS, 2019, 35 (14) : I81 - I89