A fast hierarchical clustering algorithm for large-scale protein sequence data sets

被引：16

作者：

Szilagyi, Sandor M. ^{[1
]}

Szilagyi, Laszlo ^{[2
,3
]}

机构：

[1] Petru Major Univ, Dept Informat, Targu Mures 540088, Romania

[2] Budapest Univ Technol & Econ, Dept Control Engn & Informat Technol, H-1117 Budapest, Hungary

[3] Sapientia Univ Transylvania, Fac Tech & Human Sci, Targu Mures 540485, Romania

来源：

COMPUTERS IN BIOLOGY AND MEDICINE | 2014年 / 48卷

关键词：

Protein sequence clustering; Markov clustering; Markov processes; Efficient computing; Sparse matrix; CLASSIFICATION; SEARCH; SCOP;

D O I：

10.1016/j.compbiomed.2014.02.016

中图分类号：

Q [生物科学];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

TRIBE-MCL is a Markov clustering algorithm that operates on a graph built from pairwise similarity information of the input data. Edge weights stored in the stochastic similarity matrix are alternately fed to the two main operations, inflation and expansion, and are normalized in each main loop to maintain the probabilistic constraint. In this paper we propose an efficient implementation of the TRIBE-MCL clustering algorithm, suitable for fast and accurate grouping of protein sequences. A modified sparse matrix structure is introduced that can efficiently handle most operations of the main loop. Taking advantage of the symmetry of the similarity matrix, a fast matrix squaring formula is also introduced to facilitate the time consuming expansion. The proposed algorithm was tested on protein sequence databases like SCOP95. In terms of efficiency, the proposed solution improves execution speed by two orders of magnitude, compared to recently published efficient solutions, reducing the total runtime well below 1 min in the case of the 11,944 proteins of SCOP95. This improvement in computation time is reached without losing anything from the partition quality. Convergence is generally reached in approximately 50 iterations. The efficient execution enabled us to perform a thorough evaluation of classification results and to formulate recommendations regarding the choice of the algorithm's parameter values. (C) 2014 Elsevier Ltd. All rights reserved.

引用

页码：94 / 101

页数：8

共 50 条

[1] Parallel Clustering Algorithm for Large-Scale Biological Data Sets
Wang, Minchao
Zhang, Wu
Ding, Wang
Dai, Dongbo
Zhang, Huiran
Xie, Hao
Chen, Luonan
Guo, Yike
Xie, Jiang
PLOS ONE, 2014, 9 (04):
[2] An Improved Affinity Propagation Clustering Algorithm for Large-scale Data Sets
Liu, Xiaonan
Yin, Meijuan
Luo, Junyong
Chen, Wuping
2013 NINTH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION (ICNC), 2013, : 894 - 899
[3] HGC: fast hierarchical clustering for large-scale single-cell data
Zou, Ziheng
Hua, Kui
Zhang, Xuegong
BIOINFORMATICS, 2021, 37 (21) : 3964 - 3965
[4] Fast spectral clustering learning with hierarchical bipartite graph for large-scale data
Yang, Xiaojun
Yu, Weizhong
Wang, Rong
Zhang, Guohao
Nie, Feiping
PATTERN RECOGNITION LETTERS, 2020, 130 : 345 - 352
[5] A fast algorithm for learning a ranking function from large-scale data sets
Raykar, Vikas C.
Duraiswami, Ramani
Krishnapuram, Balaji
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2008, 30 (07) : 1158 - 1170
[6] Fast algorithm for large-scale subspace clustering by LRR
Xie, Deyan
Nie, Feiping
Gao, Quanxue
Xiao, Song
IET IMAGE PROCESSING, 2020, 14 (08) : 1475 - 1480
[7] A fast fuzzy clustering algorithm for large-scale datasets
Shi, LK
He, PL
ADVANCED DATA MINING AND APPLICATIONS, PROCEEDINGS, 2005, 3584 : 203 - 208
[8] Privacy-preserving constrained spectral clustering algorithm for large-scale data sets
Li, Ji
Wei, Jianghong
Ye, Mao
Liu, Wenfen
Hu, Xuexian
IET INFORMATION SECURITY, 2020, 14 (03) : 321 - 331
[9] Parallel Hierarchical Clustering in Linearithmic Time for Large-Scale Sequence Analysis
Mao, Qi
Zheng, Wei
Wang, Li
Cai, Yunpeng
Mai, Volker
Sun, Yijun
2015 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2015, : 310 - 319
[10] fast_protein_cluster: parallel and optimized clustering of large-scale protein modeling data
Hung, Ling-Hong
Samudrala, Ram
BIOINFORMATICS, 2014, 30 (12) : 1774 - 1776

← 1 2 3 4 5 →