Shrinkage Clustering: a fast and size-constrained clustering algorithm for biomedical applications

被引:13
作者
Hu, Chenyue W. [1 ]
Li, Hanyang [1 ]
Qutub, Amina A. [1 ]
机构
[1] Rice Univ, Dept Bioengn, Main St, Houston, TX 77030 USA
来源
BMC BIOINFORMATICS | 2018年 / 19卷
基金
美国国家科学基金会;
关键词
Clustering; Matrix factorization; Cancer subtyping; Gene expression; GENE-EXPRESSION DATA; BREAST-CANCER; IDENTIFICATION; VALIDATION; DISCOVERY; PROGNOSIS; SUBTYPES;
D O I
10.1186/s12859-018-2022-8
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Many common clustering algorithms require a two-step process that limits their efficiency. The algorithms need to be performed repetitively and need to be implemented together with a model selection criterion. These two steps are needed in order to determine both the number of clusters present in the data and the corresponding cluster memberships. As biomedical datasets increase in size and prevalence, there is a growing need for new methods that are more convenient to implement and are more computationally efficient. In addition, it is often essential to obtain clusters of sufficient sample size to make the clustering result meaningful and interpretable for subsequent analysis. Results: We introduce Shrinkage Clustering, a novel clustering algorithm based on matrix factorization that simultaneously finds the optimal number of clusters while partitioning the data. We report its performances across multiple simulated and actual datasets, and demonstrate its strength in accuracy and speed applied to subtyping cancer and brain tissues. In addition, the algorithm offers a straightforward solution to clustering with cluster size constraints. Conclusions: Given its ease of implementation, computing efficiency and extensible structure, Shrinkage Clustering can be applied broadly to solve biomedical clustering tasks especially when dealing with large datasets.
引用
收藏
页数:11
相关论文
共 47 条
  • [1] Clustering of proximal sequence space for the identification of protein families
    Abascal, F
    Valencia, A
    [J]. BIOINFORMATICS, 2002, 18 (07) : 908 - 921
  • [2] Aeberhard S., 1992, Tech. Rep, P92
  • [3] [Anonymous], 2004, J STAT SOFTW, DOI DOI 10.18637/JSS.V011.I10
  • [4] [Anonymous], 2008, Introduction to information retrieval
  • [5] Bradley P., 2000, MICROSOFT RES, P1, DOI DOI 10.1016/S0025-7753(14)70064-8
  • [6] Metagenes and molecular pattern discovery using matrix factorization
    Brunet, JP
    Tamayo, P
    Golub, TR
    Mesirov, JP
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2004, 101 (12) : 4164 - 4169
  • [7] Isofunctional Protein Subfamily Detection Using Data Integration and Spectral Clustering
    de Lima, Elisa Boari
    Meira Junior, Wagner
    de Melo-Minardi, Raquel Cardoso
    [J]. PLOS COMPUTATIONAL BIOLOGY, 2016, 12 (06)
  • [8] Clustering cancer gene expression data: a comparative study
    de Souto, Marcilio C. P.
    Costa, Ivan G.
    de Araujo, Daniel S. A.
    Ludermir, Teresa B.
    Schliep, Alexander
    [J]. BMC BIOINFORMATICS, 2008, 9 (1)
  • [9] Ding C, 2005, SIAM PROC S, P606
  • [10] Identifying distinct classes of bladder carcinoma using microarrays
    Dyrskjot, L
    Thykjaer, T
    Kruhoffer, M
    Jensen, JL
    Marcussen, N
    Hamilton-Dutoit, S
    Wolf, H
    Orntoft, TF
    [J]. NATURE GENETICS, 2003, 33 (01) : 90 - 96