Fast embedding methods for clustering tens of thousands of sequences

被引:4
作者
Blackshields, Gordon [1 ]
Larkin, Mark [1 ]
Wallace, Iain M. [1 ]
Wilm, Andreas [1 ]
Higgins, Desmond G. [1 ]
机构
[1] Univ Coll Dublin, UCD Conway Inst Biomol & Biomed Sci, Dublin 4, Ireland
关键词
clustering; edit distance; multiple sequence alignment; data embedding;
D O I
10.1016/j.compbiolchem.2008.03.005
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires computer memory and time proportional to N-2 for N sequences. For small N or say up to 10000 or so, this can be accomplished in reasonable times for sequences of moderate length. For very large N, however, this becomes increasingly prohibitive. In this paper, we have tested variations on a class of published embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances. We show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignments. Source code is available on request from the authors. (c) 2008 Elsevier Ltd. All rights reserved.
引用
收藏
页码:282 / 286
页数:5
相关论文
共 22 条
  • [1] ON LIPSCHITZ EMBEDDING OF FINITE METRIC-SPACES IN HILBERT-SPACE
    BOURGAIN, J
    [J]. ISRAEL JOURNAL OF MATHEMATICS, 1985, 52 (1-2) : 46 - 52
  • [2] Dongen Svan, 2000, GRAPH CLUSTERING FLO
  • [3] An efficient algorithm for large-scale detection of protein families
    Enright, AJ
    Van Dongen, S
    Ouzounis, CA
    [J]. NUCLEIC ACIDS RESEARCH, 2002, 30 (07) : 1575 - 1584
  • [4] Faloutsos C., 1995, P 1995 ACM SIGMOD IN, P163
  • [5] PROGRESSIVE SEQUENCE ALIGNMENT AS A PREREQUISITE TO CORRECT PHYLOGENETIC TREES
    FENG, DF
    DOOLITTLE, RF
    [J]. JOURNAL OF MOLECULAR EVOLUTION, 1987, 25 (04) : 351 - 360
  • [6] Pfam:: clans, web tools and services
    Finn, Robert D.
    Mistry, Jaina
    Schuster-Bockler, Benjamin
    Griffiths-Jones, Sam
    Hollich, Volker
    Lassmann, Timo
    Moxon, Simon
    Marshall, Mhairi
    Khanna, Ajay
    Durbin, Richard
    Eddy, Sean R.
    Sonnhammer, Erik L. L.
    Bateman, Alex
    [J]. NUCLEIC ACIDS RESEARCH, 2006, 34 : D247 - D251
  • [7] Fritzke B., 1995, ADV NEURAL INFORMATI, V7, P625
  • [8] Rfam: annotating non-coding RNAs in complete genomes
    Griffiths-Jones, S
    Moxon, S
    Marshall, M
    Khanna, A
    Eddy, SR
    Bateman, A
    [J]. NUCLEIC ACIDS RESEARCH, 2005, 33 : D121 - D124
  • [9] Properties of embedding methods for similarity searching in metric spaces
    Hjaltason, GR
    Samet, H
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2003, 25 (05) : 530 - 549
  • [10] HRISTESCU G, 1999, CLUSTER PRESERVING E