Statistical Significance of Clustering for High-Dimension, Low-Sample Size Data

被引:204
作者
Liu, Yufeng [1 ]
Hayes, David Neil [2 ]
Nobel, Andrew
Marron, J. S. [2 ]
机构
[1] Univ N Carolina, Carolina Ctr Genome Sci, Dept Stat & Operat Res, Chapel Hill, NC 27599 USA
[2] Univ N Carolina, Lineberger Comprehens Canc Ctr, Chapel Hill, NC 27599 USA
基金
美国国家卫生研究院; 美国国家科学基金会;
关键词
Clustering; High-dimension low-sample data; k-means; Microarray gene expression data; p value; Statistical significance;
D O I
10.1198/016214508000000454
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Clustering methods provide a powerful tool for the exploratory analysis of high-dimension, low-sample size (HDLSS) data sets, such as gene expression microarray data. A fundamental statistical issue in clustering is which clusters are ''really there'', as opposed to being artifacts of the natural sampling variation. We propose SigClust as a simple and natural approach to this fundamental statistical problem. In particular, we define a cluster as data coming from a single Gaussian distribution and formulate the problem of assessing statistical significance of clustering as a testing procedure. This Gaussian null assumption allows direct formulation of p values that effectively quantify the significance of a given clustering. HDLSS covariance estimation for SigClust is achieved by a combination of invariance principles, together with a factor analysis model. The properties of SigClust are studied. Simulated examples, as well as an application to a real cancer microarray data set, show that the proposed method works remarkably well for assessing significance of clustering. Some theoretical results also are obtained.
引用
收藏
页码:1281 / 1293
页数:13
相关论文
共 37 条
[1]   The high-dimension, low-sample-size geometric representation holds under mild conditions [J].
Ahn, Jeongyoun ;
Marron, J. S. ;
Muller, Keith M. ;
Chi, Yueh-Yun .
BIOMETRIKA, 2007, 94 (03) :760-766
[2]  
[Anonymous], 2005, FINDING GROUPS DATA, DOI DOI 10.1002/9780470316801
[3]  
[Anonymous], 1975, CLUSTERING ALGORITHM
[4]  
[Anonymous], GENOME BIOL
[5]   A cluster validity framework for genome expression data [J].
Azuaje, F .
BIOINFORMATICS, 2002, 18 (02) :319-320
[6]  
BAIK J, 2004, ARXIVEMATHST048165V1
[7]   MODEL-BASED GAUSSIAN AND NON-GAUSSIAN CLUSTERING [J].
BANFIELD, JD ;
RAFTERY, AE .
BIOMETRICS, 1993, 49 (03) :803-821
[8]  
Benjamini Y, 2001, ANN STAT, V29, P1165
[9]   Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses [J].
Bhattacharjee, A ;
Richards, WG ;
Staunton, J ;
Li, C ;
Monti, S ;
Vasa, P ;
Ladd, C ;
Beheshti, J ;
Bueno, R ;
Gillette, M ;
Loda, M ;
Weber, G ;
Mark, EJ ;
Lander, ES ;
Wong, W ;
Johnson, BE ;
Golub, TR ;
Sugarbaker, DJ ;
Meyerson, M .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2001, 98 (24) :13790-13795
[10]   ON SOME SIGNIFICANCE TESTS IN CLUSTER-ANALYSIS [J].
BOCK, HH .
JOURNAL OF CLASSIFICATION, 1985, 2 (01) :77-108