Fast and memory efficient approach for mapping NGS reads to a reference genome

被引:16
作者
Kumar, Sanjeev [1 ]
Agarwal, Suneeta [1 ]
Ranvijay [1 ]
机构
[1] NIT Allahabad, CSED, Allahabad 211004, Uttar Pradesh, India
关键词
Indexing; read alignment; burrows wheeler transform; wavelet tree; suffix array; genome; ALIGNMENT; ACCURATE;
D O I
10.1142/S0219720019500082
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
New generation sequencing machines: Illumina and Solexa can generate millions of short reads from a given genome sequence on a single run. Alignment of these reads to a reference genome is a core step in Next-generation sequencing data analysis such as genetic variation and genome resequencing etc. Therefore there is a need of a new approach, efficient with respect to memory as well as time to align these enormous reads with the reference genome. Existing techniques such as MAQ, Bowtie, BWA, BWBBLE, Subread, Kart, and Minimap2 require huge memory for whole reference genome indexing and reads alignment. Gapped alignment versions of these techniques are also 20-40% slower than their respective normal versions. In this paper, an efficient approach: WIT for reference genome indexing and reads alignment using Burrows- Wheeler Transform (BWT) and Wavelet Tree (WT) is proposed. Both exact and approximate alignments are possible by it. Experimental work shows that the proposed approach WIT performs the best in case of protein sequence indexing. For indexing, the reference genome space required by WIT is 0.6 N (N is the size of reference genome) whereas existing techniques BWA, Subread, Kart, and Minimap2 require space in between 1.25 N to 5 N. Experimentally, it is also observed that even using such small index size alignment time of proposed approach is comparable in comparison to BWA, Subread, Kart, and Minimap2. Other alignment parameters accuracy and confidentiality are also experimentally shown to be better than Minimap2. The source code of the proposed approach WIT is available at http://www.algorithm-skg.com/wit/home.html.
引用
收藏
页数:17
相关论文
共 22 条
[1]  
Adjeroh D., 2008, The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching
[2]  
[Anonymous], BIOINFORMATICS
[3]  
Burrows M, 1994, SRC RES REPORT, V124
[4]   An accurate and fast alignment-free method for profiling microbial communities [J].
Diem-Trang Pham ;
Gao, Shanshan ;
Vinhthuy Phan .
JOURNAL OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, 2017, 15 (03)
[5]  
Gog S, INT S EXP ALG, P326
[6]  
Grossi R, 2003, SIAM PROC S, P841
[7]   Short read alignment with populations of genomes [J].
Huang, Lin ;
Popic, Victoria ;
Batzoglou, Serafim .
BIOINFORMATICS, 2013, 29 (13) :361-370
[8]   CS2A: a compressed suffix array-based method for short read alignment [J].
Huo, Hongwei ;
Sun, Zhigang ;
Li, Shuangjiang ;
Vitter, Jeffrey Scott ;
Wang, Xinkun ;
Yu, Qiang ;
Huan, Jun .
2016 DATA COMPRESSION CONFERENCE (DCC), 2016, :271-278
[9]   Approximate string matching using compressed suffix arrays [J].
Huynh, TND ;
Hon, WK ;
Lam, TW ;
Sung, WK .
THEORETICAL COMPUTER SCIENCE, 2006, 352 (1-3) :240-249
[10]   WBFQC: A new approach for compressing next-generation sequencing data splitting into homogeneous streams [J].
Kumar, Sanjeev ;
Agarwal, Suneeta ;
Ranvijay .
JOURNAL OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, 2018, 16 (05)