A Hierarchical Algorithm for Extreme Clustering

被引:56
作者
Kobren, Ari [1 ]
Monath, Nicholas [1 ]
Krishnamurthy, Akshay [1 ]
McCallum, Andrew [1 ]
机构
[1] Univ Massachusetts Amherst, Coll Informat & Comp Sci, Amherst, MA 01002 USA
来源
KDD'17: PROCEEDINGS OF THE 23RD ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING | 2017年
基金
美国国家科学基金会;
关键词
Clustering; Large-scale learning;
D O I
10.1145/3097983.3098079
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many modern clustering methods scale well to a large number of data points, N, but not to a large number of clusters, K. This paper introduces PERCH, a new non-greedy, incremental algorithm for hierarchical clustering that scales to both massive N and K-a problem setting we term extreme clustering. Our algorithm efficiently routes new data points to the leaves of an incrementally-built tree. Motivated by the desire for both accuracy and speed, our approach performs tree rotations for the sake of enhancing subtree purity and encouraging balancedness. We prove that, under a natural separability assumption, our non-greedy algorithm will produce trees with perfect dendrogram purity regardless of data arrival order. Our experiments demonstrate that PERCH constructs more accurate trees than other tree-building clustering algorithms and scales well with both N and K, achieving a higher quality clustering than the strongest flat clustering competitor in nearly half the time.
引用
收藏
页码:255 / 264
页数:10
相关论文
共 41 条
[1]  
[Anonymous], S DISCR ALG
[2]  
[Anonymous], INT C MACH LEARN
[3]  
[Anonymous], NUCL ACIDS RES
[4]  
[Anonymous], 1996, P 2 INT C KNOWL DISC
[5]  
[Anonymous], 1996, ACM SIGMOD INT C MAN
[6]  
[Anonymous], 2013, J MACHINE LEARNING R
[7]  
[Anonymous], J EXPT ALGORITHMICS
[8]  
[Anonymous], ADV NEURAL INFORM PR
[9]  
[Anonymous], 2009, P CVPR
[10]  
[Anonymous], FDN COMPUTER SCI