A fast hierarchical clustering algorithm for large-scale protein sequence data sets

被引:16
|
作者
Szilagyi, Sandor M. [1 ]
Szilagyi, Laszlo [2 ,3 ]
机构
[1] Petru Major Univ, Dept Informat, Targu Mures 540088, Romania
[2] Budapest Univ Technol & Econ, Dept Control Engn & Informat Technol, H-1117 Budapest, Hungary
[3] Sapientia Univ Transylvania, Fac Tech & Human Sci, Targu Mures 540485, Romania
关键词
Protein sequence clustering; Markov clustering; Markov processes; Efficient computing; Sparse matrix; CLASSIFICATION; SEARCH; SCOP;
D O I
10.1016/j.compbiomed.2014.02.016
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
TRIBE-MCL is a Markov clustering algorithm that operates on a graph built from pairwise similarity information of the input data. Edge weights stored in the stochastic similarity matrix are alternately fed to the two main operations, inflation and expansion, and are normalized in each main loop to maintain the probabilistic constraint. In this paper we propose an efficient implementation of the TRIBE-MCL clustering algorithm, suitable for fast and accurate grouping of protein sequences. A modified sparse matrix structure is introduced that can efficiently handle most operations of the main loop. Taking advantage of the symmetry of the similarity matrix, a fast matrix squaring formula is also introduced to facilitate the time consuming expansion. The proposed algorithm was tested on protein sequence databases like SCOP95. In terms of efficiency, the proposed solution improves execution speed by two orders of magnitude, compared to recently published efficient solutions, reducing the total runtime well below 1 min in the case of the 11,944 proteins of SCOP95. This improvement in computation time is reached without losing anything from the partition quality. Convergence is generally reached in approximately 50 iterations. The efficient execution enabled us to perform a thorough evaluation of classification results and to formulate recommendations regarding the choice of the algorithm's parameter values. (C) 2014 Elsevier Ltd. All rights reserved.
引用
收藏
页码:94 / 101
页数:8
相关论文
共 50 条
  • [1] Parallel Clustering Algorithm for Large-Scale Biological Data Sets
    Wang, Minchao
    Zhang, Wu
    Ding, Wang
    Dai, Dongbo
    Zhang, Huiran
    Xie, Hao
    Chen, Luonan
    Guo, Yike
    Xie, Jiang
    PLOS ONE, 2014, 9 (04):
  • [2] An Improved Affinity Propagation Clustering Algorithm for Large-scale Data Sets
    Liu, Xiaonan
    Yin, Meijuan
    Luo, Junyong
    Chen, Wuping
    2013 NINTH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION (ICNC), 2013, : 894 - 899
  • [3] HGC: fast hierarchical clustering for large-scale single-cell data
    Zou, Ziheng
    Hua, Kui
    Zhang, Xuegong
    BIOINFORMATICS, 2021, 37 (21) : 3964 - 3965
  • [4] Fast spectral clustering learning with hierarchical bipartite graph for large-scale data
    Yang, Xiaojun
    Yu, Weizhong
    Wang, Rong
    Zhang, Guohao
    Nie, Feiping
    PATTERN RECOGNITION LETTERS, 2020, 130 : 345 - 352
  • [5] A fast algorithm for learning a ranking function from large-scale data sets
    Raykar, Vikas C.
    Duraiswami, Ramani
    Krishnapuram, Balaji
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2008, 30 (07) : 1158 - 1170
  • [6] Fast algorithm for large-scale subspace clustering by LRR
    Xie, Deyan
    Nie, Feiping
    Gao, Quanxue
    Xiao, Song
    IET IMAGE PROCESSING, 2020, 14 (08) : 1475 - 1480
  • [7] A fast fuzzy clustering algorithm for large-scale datasets
    Shi, LK
    He, PL
    ADVANCED DATA MINING AND APPLICATIONS, PROCEEDINGS, 2005, 3584 : 203 - 208
  • [8] Privacy-preserving constrained spectral clustering algorithm for large-scale data sets
    Li, Ji
    Wei, Jianghong
    Ye, Mao
    Liu, Wenfen
    Hu, Xuexian
    IET INFORMATION SECURITY, 2020, 14 (03) : 321 - 331
  • [9] Parallel Hierarchical Clustering in Linearithmic Time for Large-Scale Sequence Analysis
    Mao, Qi
    Zheng, Wei
    Wang, Li
    Cai, Yunpeng
    Mai, Volker
    Sun, Yijun
    2015 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2015, : 310 - 319
  • [10] fast_protein_cluster: parallel and optimized clustering of large-scale protein modeling data
    Hung, Ling-Hong
    Samudrala, Ram
    BIOINFORMATICS, 2014, 30 (12) : 1774 - 1776