A Hierarchical Algorithm for Extreme Clustering

被引:56
作者
Kobren, Ari [1 ]
Monath, Nicholas [1 ]
Krishnamurthy, Akshay [1 ]
McCallum, Andrew [1 ]
机构
[1] Univ Massachusetts Amherst, Coll Informat & Comp Sci, Amherst, MA 01002 USA
来源
KDD'17: PROCEEDINGS OF THE 23RD ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING | 2017年
基金
美国国家科学基金会;
关键词
Clustering; Large-scale learning;
D O I
10.1145/3097983.3098079
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many modern clustering methods scale well to a large number of data points, N, but not to a large number of clusters, K. This paper introduces PERCH, a new non-greedy, incremental algorithm for hierarchical clustering that scales to both massive N and K-a problem setting we term extreme clustering. Our algorithm efficiently routes new data points to the leaves of an incrementally-built tree. Motivated by the desire for both accuracy and speed, our approach performs tree rotations for the sake of enhancing subtree purity and encouraging balancedness. We prove that, under a natural separability assumption, our non-greedy algorithm will produce trees with perfect dendrogram purity regardless of data arrival order. Our experiments demonstrate that PERCH constructs more accurate trees than other tree-building clustering algorithms and scales well with both N and K, achieving a higher quality clustering than the strongest flat clustering competitor in nearly half the time.
引用
收藏
页码:255 / 264
页数:10
相关论文
共 41 条
[31]  
[Anonymous], ACM S THEOR COMP
[32]  
[Anonymous], INT C VER LARG DAT
[33]  
Blundell C., 2011, Mixture Estimation and Applications
[34]  
Clauset A, 2004, PHYS REV E, V70, DOI 10.1103/PhysRevE.70.066111
[35]  
Daume III H., 2017, LEARNING INT C MACHI
[36]  
Eisen M., 1998, CLUSTER ANAL DISPLAY
[37]  
LICHMAN M., 2013, UCI MACHINE LEARNING
[38]  
Omohundro Stephen M, 1989, Five Balltree Construction Algorithms.
[39]  
Pedregosa F, 2011, J MACH LEARN RES, V12, P2825
[40]  
Sedgewick Robert, 1988, Algorithms, Vsecond