DATA SPECTROSCOPY: EIGENSPACES OF CONVOLUTION OPERATORS AND CLUSTERING

被引:48
作者
Shi, Tao [1 ]
Belkin, Mikhail [2 ]
Yu, Bin [3 ]
机构
[1] Ohio State Univ, Dept Stat, Columbus, OH 43210 USA
[2] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
[3] Univ Calif Berkeley, Dept Stat, Berkeley, CA 94720 USA
基金
美国国家科学基金会;
关键词
Gaussian kernel; spectral clustering; kernel principal component analysis; support vector machines; unsupervised learning; IMAGE SEGMENTATION;
D O I
10.1214/09-AOS700
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
This paper focuses on obtaining clustering information about a distribution from its i.i.d. samples. We develop theoretical results to understand and use clustering information contained in the eigenvectors of data adjacency matrices based on a radial kernel function with a sufficiently fast tail decay. In particular, we provide population analyses to gain insights into which eigenvectors should be used and when the clustering information for the distribution can be recovered from the sample. We learn that a fixed number of top eigenvectors might at the same time contain redundant clustering information and miss relevant clustering information. We use this insight to design the data spectroscopic clustering (DaSpec) algorithm that utilizes properly selected eigenvectors to determine the number of clusters automatically and to group the data accordingly. Our findings extend the intuitions underlying existing spectral techniques such as spectral clustering and Kernel Principal Components Analysis, and provide new understanding into their usability and modes of failure. Simulation Studies and experiments on real-world data are conducted to show the potential of our algorithm. In particular, DaSpec is found to handle unbalanced groups and recover clusters of different shapes better than the competing methods.
引用
收藏
页码:3960 / 3984
页数:25
相关论文
共 22 条
[1]  
[Anonymous], 1990, ADV NEURAL INFORM PR
[2]  
[Anonymous], 2007, Advances in neural information processing systems
[3]  
Belkin M., 2003, ADV NEURAL INFORM PR, V15, P953
[4]  
DHILLON I, 2005, TF0425 UTCS U TEX AU
[5]   HORSESHOES IN MULTIDIMENSIONAL SCALING AND LOCAL KERNEL METHODS [J].
Diaconis, Persi ;
Goel, Sharad ;
Holmes, Susan .
ANNALS OF APPLIED STATISTICS, 2008, 2 (03) :777-807
[6]   Random matrix approximation of spectra of integral operators [J].
Koltchinskii, V ;
Giné, E .
BERNOULLI, 2000, 6 (01) :113-167
[7]   Contour and texture analysis for image segmentation [J].
Malik, J ;
Belongie, S ;
Leung, T ;
Shi, JB .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2001, 43 (01) :7-27
[8]  
NG A, 2002, ADV NEURAL INFORM PR, V14, P955
[9]  
PARLETT BN, 1980, SUMMETRIC EIGENVALUE
[10]  
PERONA P, 1998, P EUR C COMP VIS, P655