A NEW INFORMATION CRITERION BASED ON LANGEVIN MIXTURE DISTRIBUTION FOR CLUSTERING CIRCULAR DATA WITH APPLICATION TO TIME COURSE GENOMIC DATA

被引:2
作者
Qiu, Xing [1 ]
Wu, Shuang [1 ]
Wu, Hulin [1 ]
机构
[1] Univ Rochester, Dept Biostat & Computat Biol, Rochester, NY 14642 USA
基金
美国国家卫生研究院;
关键词
Circular statistics; clustering; information criterion; Langevin distribution; mixture model; model selection; MODEL SELECTION;
D O I
10.5705/ss.2013.030
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Common pre-processing procedures for time course microarray analysis such as standardization and gene filtering based on the functional F-test, often result in directional data that lie on a sphere Sd-1. While there have been some efforts in designing spherical clustering algorithms, few researchers have developed methods for selecting the number of clusters for spherical cluster analysis. In this paper, we focus on circular data on S-1 and propose a novel information-based criterion ICCC (information criterion for circular clustering) to determine the number of clusters when clustering circular data. This new criterion, ICCC, is based on a finite mixture model of Langevin distributions and is derived from the asymptotic properties of the maximum likelihood of the Langevin mixture distribution. Through the study of both simulated data and a large set of time course microarray data, we demonstrate that the ICCC criterion provides better estimates of the number of clusters than such existing methods: AIC, BIC, the Gap criterion, and the Maitra-Ramler criterion.
引用
收藏
页码:1459 / 1476
页数:18
相关论文
共 31 条
[1]  
Akaike H., 1973, Selected Papers of Hirotugu Akaike, P199, DOI 10.1007/978-1-4612-1694-0_15
[2]   Gene Ontology: tool for the unification of biology [J].
Ashburner, M ;
Ball, CA ;
Blake, JA ;
Botstein, D ;
Butler, H ;
Cherry, JM ;
Davis, AP ;
Dolinski, K ;
Dwight, SS ;
Eppig, JT ;
Harris, MA ;
Hill, DP ;
Issel-Tarver, L ;
Kasarskis, A ;
Lewis, S ;
Matese, JC ;
Richardson, JE ;
Ringwald, M ;
Rubin, GM ;
Sherlock, G .
NATURE GENETICS, 2000, 25 (01) :25-29
[3]  
Banerjee A, 2005, J MACH LEARN RES, V6, P1345
[4]   CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING [J].
BENJAMINI, Y ;
HOCHBERG, Y .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1995, 57 (01) :289-300
[5]   MULTI-SAMPLE CLUSTER-ANALYSIS USING AKAIKES INFORMATION CRITERION [J].
BOZDOGAN, H ;
SCLOVE, SL .
ANNALS OF THE INSTITUTE OF STATISTICAL MATHEMATICS, 1984, 36 (01) :163-180
[6]  
Celeux G., 1993, Journal of Statistical Computation and Simulation, V47, P127, DOI DOI 10.1080/00949659308811525
[7]  
Conway J., 2013, Fundamental Principles of Mathematical Sciences, V290
[8]   Concept decompositions for large sparse text data using clustering [J].
Dhillon, IS ;
Modha, DS .
MACHINE LEARNING, 2001, 42 (1-2) :143-175
[9]   Model-based clustering on the unit sphere with an illustration using gene expression profiles [J].
Dortet-Bernadet, Jean-Luc ;
Wicker, Nicolas .
BIOSTATISTICS, 2008, 9 (01) :66-80
[10]   Cluster analysis and display of genome-wide expression patterns [J].
Eisen, MB ;
Spellman, PT ;
Brown, PO ;
Botstein, D .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1998, 95 (25) :14863-14868