Sparse Convex Clustering

被引:43
作者
Wang, Binhuan [1 ]
Zhang, Yilong [2 ]
Sun, Will Wei [3 ]
Fang, Yixin [4 ]
机构
[1] NYU, Sch Med, Dept Populat Hlth, 650 First Ave Rm 578, New York, NY 10016 USA
[2] Merck Res Labs, Rahway, NJ USA
[3] Univ Miami, Sch Business Adm, Dept Management Sci, Coral Gables, FL 33124 USA
[4] New Jersey Inst Technol, Dept Math Sci, Newark, NJ 07102 USA
关键词
Convex clustering; Finite sample error; Group LASSO; High-dimensionality; Sparsity; VARIABLE SELECTION; LASSO; REGRESSION; ALGORITHM; NUMBER;
D O I
10.1080/10618600.2017.1377081
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
u Convex clustering, a convex relaxation of k-means clustering and hierarchical clustering, has drawn recent attentions since it nicely addresses the instability issue of traditional nonconvex clustering methods. Although its computational and statistical properties have been recently studied, the performance of convex clustering has not yet been investigated in the high-dimensional clustering scenario, where the data contains a large number of features and many of them carry no information about the clustering structure. In this article, we demonstrate that the performance of convex clustering could be distorted when the uninformative features are included in the clustering. To overcome it, we introduce a new clustering method, referred to as Sparse Convex Clustering, to simultaneously cluster observations and conduct feature selection. The key idea is to formulate convex clustering in a form of regularization, with an adaptive group-lasso penalty term on cluster centers. To optimally balance the trade-off between the cluster fitting and sparsity, a tuning criterion based on clustering stability is developed. Theoretically, we obtain a finite sample error bound for our estimator and further establish its variable selection consistency. The effectiveness of the proposed method is examined through a variety of numerical experiments and a real data application. Supplementary material for this article is available online.
引用
收藏
页码:393 / 403
页数:11
相关论文
共 31 条
  • [1] Alelyani S, 2014, CH CRC DATA MIN KNOW, P29
  • [2] [Anonymous], FOUND TRENDS MACH LE
  • [3] [Anonymous], 2006, Journal of the Royal Statistical Society, Series B
  • [4] [Anonymous], 2010, ARXIV
  • [5] Convex Biclustering
    Chi, Eric C.
    Allen, Genevera I.
    Baraniuk, Richard G.
    [J]. BIOMETRICS, 2017, 73 (01) : 10 - 19
  • [6] Splitting Methods for Convex Clustering
    Chi, Eric C.
    Lange, Kenneth
    [J]. JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2015, 24 (04) : 994 - 1013
  • [7] Selection of the number of clusters via the bootstrap method
    Fang, Yixin
    Wang, Junhui
    [J]. COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2012, 56 (03) : 468 - 477
  • [8] Gabay D., 1976, Computers & Mathematics with Applications, V2, P17, DOI 10.1016/0898-1221(76)90003-1
  • [9] GLOWINSKI R, 1975, REV FR AUTOMAT INFOR, V9, P41
  • [10] Pairwise Variable Selection for High-Dimensional Model-Based Clustering
    Guo, Jian
    Levina, Elizaveta
    Michailidis, George
    Zhu, Ji
    [J]. BIOMETRICS, 2010, 66 (03) : 793 - 804