Mapping RNA-seq reads to transcriptomes efficiently based on learning to hash method

被引:6
作者
Yu, Xueting [1 ,2 ]
Liu, Xuejun [1 ,2 ]
机构
[1] Nanjing Univ Aeronaut & Astronaut, Coll Comp Sci & Technol, MIIT Key Lab Pattern Anal & Machine Intelligence, Nanjing 211106, Peoples R China
[2] Collaborat Innovat Ctr Novel Software Technol & I, Nanjing 210023, Peoples R China
基金
国家重点研发计划;
关键词
Read mapping; Learning to hash; Bit-mapping; RNA-seq; Transcriptome; ALIGNMENT; QUANTIFICATION;
D O I
10.1016/j.compbiomed.2019.103539
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Accurate and efficient read-alignment is one of the fundamental challenges in RNA-seq analysis. Due to the increasingly large number of reads generated from the RNA-seq experiments, read-alignment is a time-consuming task. Many mappers adopted various strategies to look for potential alignment locations for reads in a tolerable time, and provide adequate information for downstream analysis. But in some transcript analysis tasks, such as transcriptome quantification, the mapping information about the transcripts and positions for reads is sufficient. Thus the original alignment problem can be simplified to a string searching problem since the reads can be mapped contiguously to the transcriptome. Some models for transcript analysis adopt more efficient strategies to solve this simplified problem, but the efficiency is still restricted by handling RNA-seq data in the original read space. We propose a method, bit-mapping, based on learning to hash algorithm for mapping reads to the transcriptome. It learns hash functions from the transcriptome and generates binary hash codes of the sequences, then maps reads to the transcriptome according to their hash codes. Bit-mapping accelerates mapping problems in RNA-seq analysis by reducing the dimension of the read. We evaluate the performance of bit-mapping based on simulated data and real data, and compare it with other popular and state-of-the-art methods, STAR, RapMap, Bowtie 2 and HISAT 2. The comparative results of simulated and real data show that the accuracy of our method is competitive to the existing mappers in terms of mapping efficiency, especially for longer reads (> 100 bp).
引用
收藏
页数:10
相关论文
共 32 条
  • [1] Assembling large genomes with single-molecule sequencing and locality-sensitive hashing
    Berlin, Konstantin
    Koren, Sergey
    Chin, Chen-Shan
    Drake, James P.
    Landolin, Jane M.
    Phillippy, Adam M.
    [J]. NATURE BIOTECHNOLOGY, 2015, 33 (06) : 623 - +
  • [2] Trimmomatic: a flexible trimmer for Illumina sequence data
    Bolger, Anthony M.
    Lohse, Marc
    Usadel, Bjoern
    [J]. BIOINFORMATICS, 2014, 30 (15) : 2114 - 2120
  • [3] Near-optimal probabilistic RNA-seq quantification (vol 34, pg 525, 2016)
    Bray, Nicolas L.
    Pimentel, Harold
    Melsted, Pall
    Pachter, Lior
    [J]. NATURE BIOTECHNOLOGY, 2016, 34 (08) : 888 - 888
  • [4] fastp: an ultra-fast all-in-one FASTQ preprocessor
    Chen, Shifu
    Zhou, Yanqing
    Chen, Yaru
    Gu, Jia
    [J]. BIOINFORMATICS, 2018, 34 (17) : 884 - 890
  • [5] On the abundance, amino acid composition, and evolutionary dynamics of low-complexity regions in proteins
    DePristo, Mark A.
    Zilversmit, Martine M.
    Hartl, Daniel L.
    [J]. GENE, 2006, 378 : 19 - 30
  • [6] STAR: ultrafast universal RNA-seq aligner
    Dobin, Alexander
    Davis, Carrie A.
    Schlesinger, Felix
    Drenkow, Jorg
    Zaleski, Chris
    Jha, Sonali
    Batut, Philippe
    Chaisson, Mark
    Gingeras, Thomas R.
    [J]. BIOINFORMATICS, 2013, 29 (01) : 15 - 21
  • [7] The challenges of sequencing by synthesis
    Fuller, Carl W.
    Middendorf, Lyle R.
    Benner, Steven A.
    Church, George M.
    Harris, Timothy
    Huang, Xiaohua
    Jovanovich, Stevan B.
    Nelson, John R.
    Schloss, Jeffery A.
    Schwartz, David C.
    Vezenov, Dmitri V.
    [J]. NATURE BIOTECHNOLOGY, 2009, 27 (11) : 1013 - 1023
  • [8] Field guide to next-generation DNA sequencers
    Glenn, Travis C.
    [J]. MOLECULAR ECOLOGY RESOURCES, 2011, 11 (05) : 759 - 769
  • [9] Modelling and simulating generic RNA-Seq experiments with the flux simulator
    Griebel, Thasso
    Zacher, Benedikt
    Ribeca, Paolo
    Raineri, Emanuele
    Lacroix, Vincent
    Guigo, Roderic
    Sammeth, Michael
    [J]. NUCLEIC ACIDS RESEARCH, 2012, 40 (20) : 10073 - 10083
  • [10] Heo JP, 2012, PROC CVPR IEEE, P2957, DOI 10.1109/CVPR.2012.6248024