Sketching and Sequence Alignment: A Rate-Distortion Perspective

被引:1
作者
Shomorony, Ilan [1 ]
Kamath, Govinda M. [2 ]
机构
[1] Univ Illinois, Urbana, IL 61801 USA
[2] Microsoft Res New England, Cambridge, MA USA
来源
2021 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY (ISIT) | 2021年
关键词
SIDE INFORMATION;
D O I
10.1109/ISIT45174.2021.9518131
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Pairwise alignment of DNA sequencing data is a ubiquitous task in bioinformatics and typically represents a heavy computational burden. A standard approach to speed up this task is to compute "sketches" of the DNA reads (typically via hashing-based techniques) that allow the efficient computation of pairwise alignment scores. We propose a rate-distortion framework to study the problem of computing sketches that achieve the optimal tradeoff between sketch size and alignment estimation distortion. We consider the simple setting of i.i.d. error-free sources of length n and introduce a new sketching algorithm called "locational hashing." While standard approaches in the literature based on min-hashes require B = (1/D) . O (log n) bits to achieve a distortion D, our proposed approach only requires B = log(2) (1/D) . O(1) bits. This can lead to significant computational savings in pairwise alignment estimation.
引用
收藏
页码:3308 / 3313
页数:6
相关论文
共 24 条
[1]   HYPOTHESIS-TESTING WITH COMMUNICATION CONSTRAINTS [J].
AHLSWEDE, R ;
CSISZAR, I .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1986, 32 (04) :533-542
[2]   SOURCE CODING WITH SIDE INFORMATION AND A CONVERSE FOR DEGRADED BROADCAST CHANNELS [J].
AHLSWEDE, RF ;
KORNER, J .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1975, 21 (06) :629-637
[3]   Spectral Jaccard Similarity: A New Approach to Estimating Pairwise Sequence Alignments [J].
Baharav, Tavor Z. ;
Kamath, Govinda M. ;
Tse, David N. ;
Shomorony, Ilan .
PATTERNS, 2020, 1 (06)
[4]   Assembling large genomes with single-molecule sequencing and locality-sensitive hashing [J].
Berlin, Konstantin ;
Koren, Sergey ;
Chin, Chen-Shan ;
Drake, James P. ;
Landolin, Jane M. ;
Phillippy, Adam M. .
NATURE BIOTECHNOLOGY, 2015, 33 (06) :623-+
[5]   Syntactic clustering of the Web [J].
Broder, AZ ;
Glassman, SC ;
Manasse, MS ;
Zweig, G .
COMPUTER NETWORKS AND ISDN SYSTEMS, 1997, 29 (8-13) :1157-1166
[6]  
Han TS, 1998, IEEE T INFORM THEORY, V44, P2300, DOI 10.1109/18.720540
[7]  
Indyk P., 2007, GRADUATE COURSE NOTE, V33, P617
[8]   A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases [J].
Jain, Chirag ;
Dilthey, Alexander ;
Koren, Sergey ;
Aluru, Srinivas ;
Phillippy, Adam M. .
RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, RECOMB 2017, 2017, 10229 :66-81
[9]  
Kamath G. M., 2020, ADV NEUR IN
[10]   HINGE: long-read assembly achieves optimal repeat resolution [J].
Kamath, Govinda M. ;
Shomorony, Ilan ;
Xia, Fei ;
Courtade, Thomas A. ;
Tse, David N. .
GENOME RESEARCH, 2017, 27 (05) :747-756