Statistical Significance of Clustering for High-Dimension, Low-Sample Size Data

被引：204

作者：

Liu, Yufeng ^{[1
]}

Hayes, David Neil ^{[2
]}

Nobel, Andrew

Marron, J. S. ^{[2
]}

机构：

[1] Univ N Carolina, Carolina Ctr Genome Sci, Dept Stat & Operat Res, Chapel Hill, NC 27599 USA

[2] Univ N Carolina, Lineberger Comprehens Canc Ctr, Chapel Hill, NC 27599 USA

来源：

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION | 2008年 / 103卷 / 483期

基金：

美国国家卫生研究院; 美国国家科学基金会;

关键词：

Clustering; High-dimension low-sample data; k-means; Microarray gene expression data; p value; Statistical significance;

D O I：

10.1198/016214508000000454

中图分类号：

O21 [概率论与数理统计]; C8 [统计学];

学科分类号：

020208 ; 070103 ; 0714 ;

摘要：

Clustering methods provide a powerful tool for the exploratory analysis of high-dimension, low-sample size (HDLSS) data sets, such as gene expression microarray data. A fundamental statistical issue in clustering is which clusters are ''really there'', as opposed to being artifacts of the natural sampling variation. We propose SigClust as a simple and natural approach to this fundamental statistical problem. In particular, we define a cluster as data coming from a single Gaussian distribution and formulate the problem of assessing statistical significance of clustering as a testing procedure. This Gaussian null assumption allows direct formulation of p values that effectively quantify the significance of a given clustering. HDLSS covariance estimation for SigClust is achieved by a combination of invariance principles, together with a factor analysis model. The properties of SigClust are studied. Simulated examples, as well as an application to a real cancer microarray data set, show that the proposed method works remarkably well for assessing significance of clustering. Some theoretical results also are obtained.

引用

页码：1281 / 1293

页数：13

共 37 条

[1] The high-dimension, low-sample-size geometric representation holds under mild conditions [J].

Ahn, Jeongyoun ;

Marron, J. S. ;

Muller, Keith M. ;

Chi, Yueh-Yun .

BIOMETRIKA, 2007, 94 (03) :760-766

[2]

[Anonymous], 2005, FINDING GROUPS DATA, DOI DOI 10.1002/9780470316801

[3]

[Anonymous], 1975, CLUSTERING ALGORITHM

[4]

[Anonymous], GENOME BIOL

[5] A cluster validity framework for genome expression data [J].

Azuaje, F .

BIOINFORMATICS, 2002, 18 (02) :319-320

[6]

BAIK J, 2004, ARXIVEMATHST048165V1

[7] MODEL-BASED GAUSSIAN AND NON-GAUSSIAN CLUSTERING [J].

BANFIELD, JD ;

RAFTERY, AE .

BIOMETRICS, 1993, 49 (03) :803-821

[8]

Benjamini Y, 2001, ANN STAT, V29, P1165

[9] Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses [J].

Bhattacharjee, A ;

Richards, WG ;

Staunton, J ;

Li, C ;

Monti, S ;

Vasa, P ;

Ladd, C ;

Beheshti, J ;

Bueno, R ;

Gillette, M ;

Loda, M ;

Weber, G ;

Mark, EJ ;

Lander, ES ;

Wong, W ;

Johnson, BE ;

Golub, TR ;

Sugarbaker, DJ ;

Meyerson, M .

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2001, 98 (24) :13790-13795

[10] ON SOME SIGNIFICANCE TESTS IN CLUSTER-ANALYSIS [J].

BOCK, HH .

JOURNAL OF CLASSIFICATION, 1985, 2 (01) :77-108

← 1 2 3 4 →