Sequence embedding for fast construction of guide trees for multiple sequence alignment

被引:70
作者
Blackshields, Gordon [1 ]
Sievers, Fabian [1 ]
Shi, Weifeng [1 ]
Wilm, Andreas [1 ]
Higgins, Desmond G. [1 ]
机构
[1] Univ Coll Dublin, UCD Conway Inst Biomol & Biomed Sci, Dublin 4, Ireland
基金
爱尔兰科学基金会;
关键词
CLUSTAL-W; DATABASE; MAFFT; ACID;
D O I
10.1186/1748-7188-5-21
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: The most widely used multiple sequence alignment methods require sequences to be clustered as an initial step. Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires memory and time proportional to N-2 for N sequences. When N grows larger than 10,000 or so, this becomes increasingly prohibitive and can form a significant barrier to carrying out very large multiple alignments. Results: In this paper, we have tested variations on a class of embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances. Conclusions: We show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignment. Source code is available for download from http://www.clustal.org/mbed.tgz.
引用
收藏
页数:11
相关论文
共 34 条
  • [1] Fast embedding methods for clustering tens of thousands of sequences
    Blackshields, Gordon
    Larkin, Mark
    Wallace, Iain M.
    Wilm, Andreas
    Higgins, Desmond G.
    [J]. COMPUTATIONAL BIOLOGY AND CHEMISTRY, 2008, 32 (04) : 282 - 286
  • [2] The Ribosomal Database Project: improved alignments and new tools for rRNA analysis
    Cole, J. R.
    Wang, Q.
    Cardenas, E.
    Fish, J.
    Chai, B.
    Farris, R. J.
    Kulam-Syed-Mohideen, A. S.
    McGarrell, D. M.
    Marsh, T.
    Garrity, G. M.
    Tiedje, J. M.
    [J]. NUCLEIC ACIDS RESEARCH, 2009, 37 : D141 - D145
  • [3] ProbCons: Probabilistic consistency-based multiple sequence alignment
    Do, CB
    Mahabhashyam, MSP
    Brudno, M
    Batzoglou, S
    [J]. GENOME RESEARCH, 2005, 15 (02) : 330 - 340
  • [4] MUSCLE: multiple sequence alignment with high accuracy and high throughput
    Edgar, RC
    [J]. NUCLEIC ACIDS RESEARCH, 2004, 32 (05) : 1792 - 1797
  • [5] FELSENSTEIN J, 1989, CLADISTICS, V5, P166
  • [6] PROGRESSIVE SEQUENCE ALIGNMENT AS A PREREQUISITE TO CORRECT PHYLOGENETIC TREES
    FENG, DF
    DOOLITTLE, RF
    [J]. JOURNAL OF MOLECULAR EVOLUTION, 1987, 25 (04) : 351 - 360
  • [7] Pfam:: clans, web tools and services
    Finn, Robert D.
    Mistry, Jaina
    Schuster-Bockler, Benjamin
    Griffiths-Jones, Sam
    Hollich, Volker
    Lassmann, Timo
    Moxon, Simon
    Marshall, Mhairi
    Khanna, Ajay
    Durbin, Richard
    Eddy, Sean R.
    Sonnhammer, Erik L. L.
    Bateman, Alex
    [J]. NUCLEIC ACIDS RESEARCH, 2006, 34 : D247 - D251
  • [8] SOME DISTANCE PROPERTIES OF LATENT ROOT AND VECTOR METHODS USED IN MULTIVARIATE ANALYSIS
    GOWER, JC
    [J]. BIOMETRIKA, 1966, 53 : 325 - &
  • [9] Rfam: annotating non-coding RNAs in complete genomes
    Griffiths-Jones, S
    Moxon, S
    Marshall, M
    Khanna, A
    Eddy, SR
    Bateman, A
    [J]. NUCLEIC ACIDS RESEARCH, 2005, 33 : D121 - D124
  • [10] THE ALIGNMENT OF SETS OF SEQUENCES AND THE CONSTRUCTION OF PHYLETIC TREES - AN INTEGRATED METHOD
    HOGEWEG, P
    HESPER, B
    [J]. JOURNAL OF MOLECULAR EVOLUTION, 1984, 20 (02) : 175 - 186