Proving sequence aligners can guarantee accuracy in almost O(m log n) time through an average-case analysis of the seed-chain-extend heuristic

被引:5
|
作者
Shaw, Jim [1 ]
Yu, Yun William [1 ,2 ]
机构
[1] Univ Toronto, Dept Math, Toronto, ON M5S 2E4, Canada
[2] Univ Toronto Scarborough, Comp & Math Sci, Toronto, ON M1C 1A4, Canada
关键词
READ ALIGNMENT; ALGORITHMS; SEARCH;
D O I
10.1101/gr.277637.122
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Seed-chain-extend with k-mer seeds is a powerful heuristic technique for sequence alignment used by modern sequence aligners. Although effective in practice for both runtime and accuracy, theoretical guarantees on the resulting alignment do not exist for seed-chain-extend. In this work, we give the first rigorous bounds for the efficacy of seed-chain-extend with k-mers in expectation. Assume we are given a random nucleotide sequence of length similar to n that is indexed (or seeded) and a mutated substring of length similar to m <= n with mutation rate theta< 0.206. We prove that we can find a k= Theta(log n) for the k-mer size such that the expected runtime of seed-chain-extend under optimal linear-gap cost chaining and quadratic time gap extension is O(mn(f(0)) log n), where f(theta) < 2.43 center dot. holds as a loose bound. The alignment also turns out to be good; we prove that more than 1 - O(root 1/m) fraction of the homologous bases is recoverable under an optimal chain. We also show that our bounds work when k-mers are sketched, that is, only a subset of all k-mers is selected, and that sketching reduces chaining time without increasing alignment time or decreasing accuracy too much, justifying the effectiveness of sketching as a practical speedup in sequence alignment. We verify our results in simulation and on real noisy long-read data and show that our theoretical runtimes can predict real runtimes accurately. We conjecture that our bounds can be improved further, and in particular, f (theta) can be further reduced.
引用
收藏
页码:1175 / 1187
页数:13
相关论文
empty
未找到相关数据