SEED: efficient clustering of next-generation sequences

被引:43
|
作者
Bao, Ergude [2 ]
Jiang, Tao [2 ]
Kaloshian, Isgouhi [3 ]
Girke, Thomas [1 ]
机构
[1] Univ Calif Riverside, Dept Bot & Plant Sci, Riverside, CA 92521 USA
[2] Univ Calif Riverside, Dept Comp Sci & Engn, Riverside, CA 92521 USA
[3] Univ Calif Riverside, Dept Nematol, Riverside, CA 92521 USA
基金
美国国家科学基金会;
关键词
GENOME; PROGRAM; PROTEIN; SEARCH; FORMAT; FASTER; RNAS; TOOL;
D O I
10.1093/bioinformatics/btr447
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Similarity clustering of next generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads. Results: Here, we introduce SEED-an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in < 4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler for the datasets used in this study by 60-85% and 21-41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12-27% larger N50 values. Compared with other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED's utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms.
引用
收藏
页码:2502 / 2509
页数:8
相关论文
共 50 条
  • [1] Estimating the composition of species in metagenomes by clustering of next-generation read sequences
    Seok, Ho-Sik
    Hong, Woonyoung
    Kim, Jaebum
    METHODS, 2014, 69 (03) : 213 - 219
  • [2] Estimating the Number of Species in Metagenomes by Clustering Next-Generation Read Sequences
    Seok, Ho-Sik
    Hong, Woonyoung
    Kim, Jaebum
    2014 INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2014, : 52 - +
  • [3] Graph-based clustering and characterization of repetitive sequences in next-generation sequencing data
    Petr Novák
    Pavel Neumann
    Jiří Macas
    BMC Bioinformatics, 11
  • [4] Metagenome assembly through clustering of next-generation sequencing data using protein sequences
    Sim, Mikang
    Kim, Jaebum
    JOURNAL OF MICROBIOLOGICAL METHODS, 2015, 109 : 180 - 187
  • [5] A clustering method for next-generation sequences of bacterial genomes through multiomics data mapping
    Ho-Sik Seok
    Mikang Sim
    Daehwan Lee
    Jaebum Kim
    Genes & Genomics, 2014, 36 : 191 - 196
  • [6] Graph-based clustering and characterization of repetitive sequences in next-generation sequencing data
    Novak, Petr
    Neumann, Pavel
    Macas, Jiri
    BMC BIOINFORMATICS, 2010, 11
  • [7] A clustering method for next-generation sequences of bacterial genomes through multiomics data mapping
    Seok, Ho-Sik
    Sim, Mikang
    Lee, Daehwan
    Kim, Jaebum
    GENES & GENOMICS, 2014, 36 (02) : 191 - 196
  • [8] Development of Tioxazafen as a next-generation seed treatment nematicide
    Bunkers, G. J.
    South, M. S.
    Williams, J.
    McCarter, J.
    PHYTOPATHOLOGY, 2014, 104 (11) : 21 - 21
  • [9] Next-generation people for next-generation technologies
    Mittelstadt, E
    MANUFACTURING ENGINEERING, 1996, 117 (04): : 128 - 128
  • [10] Next-generation infrastructure for next-generation people
    Tyler N.
    Proceedings of the Institution of Civil Engineers: Smart Infrastructure and Construction, 2021, 173 (02) : 24 - 28