Maximising the Size of Non-Redundant Protein Datasets Using Graph Theory

被引:10
作者
Bull, Simon C. [1 ]
Muldoon, Mark R. [2 ]
Doig, Andrew J. [1 ]
机构
[1] Univ Manchester, Manchester Inst Biotechnol, Fac Life Sci, Manchester, Lancs, England
[2] Univ Manchester, Sch Math, Manchester, Lancs, England
来源
PLOS ONE | 2013年 / 8卷 / 02期
基金
英国生物技术与生命科学研究理事会;
关键词
MAXIMUM CLIQUE PROBLEM;
D O I
10.1371/journal.pone.0055484
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Analysis of protein data sets often requires prior removal of redundancy, so that data is not biased by containing similar proteins. This is usually achieved by pairwise comparison of sequences, followed by purging so that no two pairs have similarities above a chosen threshold. From a starting set, such as the PDB or a genome, one should remove as few sequences as possible, to give the largest possible non-redundant set for subsequent analysis. Protein redundancy can be represented as a graph, with proteins as nodes connected by undirected edges, if they have a pairwise similarity above the chosen threshold. The problem is then equivalent to finding the maximum independent set (MIS), where as few nodes are removed as possible to remove all edges. We tested seven MIS algorithms, three of which are new. We applied the methods to the PDB, subsets of the PDB, various genomes and the BHOLSIB benchmark datasets. For PDB subsets of up to 1000 proteins, we could compare to the exact MIS, found by the Cliquer algorithm. The best algorithm was the new method, Leaf. This works by adding clique members that have no edges to nodes outside the clique to the MIS, starting with the smallest cliques. For PDB subsets of up to 1000 members, it usually finds the MIS and is fast enough to apply to data sets of tens of thousands of proteins. Leaf gives sets that are around 10% larger than the commonly used PISCES algorithm, that are of identical quality. We therefore suggest that Leaf should be the method of choice for generating non-redundant protein data sets, though it is ineffective on dense graphs, such as the BHOLSIB benchmarks. The Leaf algorithm is available at: https://github.com/SimonCB765/Leaf, and sets from genomes and the PDB are available at: http://www.bioinf.manchester.ac.uk/leaf/.
引用
收藏
页数:12
相关论文
共 11 条
  • [1] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [2] BASIC LOCAL ALIGNMENT SEARCH TOOL
    ALTSCHUL, SF
    GISH, W
    MILLER, W
    MYERS, EW
    LIPMAN, DJ
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) : 403 - 410
  • [3] Balaji S., 2010, Adv. Model. Optim., V12, P107
  • [4] Barton G., 1996, PROTEIN STRUCTURE PR
  • [5] Simple ingredients leading to very efficient heuristics for the maximum clique problem
    Grosso, Andrea
    Locatelli, Marco
    Pullan, Wayne
    [J]. JOURNAL OF HEURISTICS, 2008, 14 (06) : 587 - 612
  • [6] HOBOHM U, 1994, PROTEIN SCI, V3, P522
  • [7] Liu P, 2009, BIOINFORMATICS BIOME, P1
  • [8] Niskanen S., 2003, CLIQUER USERS GUIDE
  • [9] A fast algorithm for the maximum clique problem
    Östergård, PRJ
    [J]. DISCRETE APPLIED MATHEMATICS, 2002, 120 (1-3) : 197 - 207
  • [10] Distributed computing in practice: the Condor experience
    Thain, D
    Tannenbaum, T
    Livny, M
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2005, 17 (2-4) : 323 - 356